Matching regular expression or leaving empty string when not found

Matching regular expression or leaving empty string when not found - python

this is my first post on this site so tell me if I mess something up. I need to find config files for files of the same name, with the difference being config files have 'str' at the end of them.
some characters + _digit + car + some more characters + str or nothing.
All files are in text form, so extension doesn't give any more information. Included in file name are also some important information, like number of occurrence, which i need to extract as well.
My approach using regex boils down to this
import re
reg = '(.*(?=\\dcar))(\\d(?=car)).*(str)?'
config_to_file1 = 'wts-lg-000191_0car_lp_str'
file1 = 'wts-lg-000191_0car_lp'
print(re.findall(reg,file1))
print(re.findall(reg,config_to_file1))
i also tried this
reg = '(.*(?=\\dcar))(\\d(?=car)).*(str)+'
I expected to get this:
[('wts-lg-000191_', '0', 'str')]
[('wts-lg-000191_', '0', '')]
But got this instead:
[('wts-lg-000191_', '0', '')]
[('wts-lg-000191_', '0', '')]
I know i don't use ? token properly, I tried looking around and I don't know what am i missing. I also want to stick with regular expression approach for practice purpose.

The main reason your regex fails is that .* before (str)? grabs the whole string to the end, and (str)? just matches the end of string position since it does not have to consume any chars (as it is optional).
However, your regex can be greatly optimized as you are overusing lookarounds. Use
reg = r'(.*?)(\d)car(?:.*(str))?'
Or
reg = r'(.*?)(\d+)car(?:.*(str))?'
See this Python demo and the regex demo.
Details
(.*?) - Group 1: any 0+ chars other than line break chars as few as possible
(\d+) - Group 2: one or more digits
car - a car string
(?:.*(str))? - an optional non-capturing group that matches 1 or 0 occurrences of
.* - any 0+ chars other than line break chars as many as possible
(str) - Group 3: str substring.

Related

Cannot seem to figure out this regex involving forward slash

I am trying to capture instances in my dataframe where a string has the following format:
/random a/random b/random c/capture this/random again/random/random
Where a string is preceded by four instances of /, and more than two / appear after it, I would like the string captured and returned in a different column. If it is not applicable to that row, return None.
In this instance capture this should be captured and placed into a new column.
This is what I tried:
def extract_special_string(df, column):
df['special_string_a'] = df[column].apply(lambda x: re.search(r'(?<=\/{4})[^\/]+(?=\/[^\/]{2,})', x).group(0) if re.search(r'(?<=\/{4})[^\/]+(?=\/[^\/]{2,})', x) else None)
extract_special_string(df, 'column')
However nothing is being captured. Can anybody help with this regex? Thanks.

You can use
df['special_string_a'] = df[column].str.extract(r'^(?:[^/]*/){4}([^/]+)(?:/[^/]*){2}', expand=False)
See the regex demo
Details:
^ - start of string
(?:[^/]*/){4} - four occurrences of any zero or more chars other than / and then a / char
([^/]+) - Capturing group 1:one or more chars other than a / char
(?:/[^/]*){2} - two occurrences of a / char and then any zero or more chars other than /.

An alternative regex approach would be to use non-greedy quantifiers.
import re
s = '/random a/random b/random c/capture this/random again/random/random'
pattern = r'/(?:.*?/){3}(.*?)(?:/.*?){2,}'
m = re.match(pattern, s)
print(m.group(1)) # 'capture this'
/(?:.*?/){3} - match the part before capture this, matching any but as few as possible characters between each pair of /s (use noncapturing group to ignore the contents)
(.*?) - capture capture this (since this is a capturing group, we can fetch the contents from <match_object>.group(1)
(?:/.*?){2,} - same as the first part, match as few characters as possible in between each pair of /s

(Python) How to check a long string against several regex?

I want to ensure that a long string can match with several regex at once.
I have a long multi line string containing a list of files and some content of the file.
DIR1\FILE1.EXT1 CONTENT11
DIR1\FILE1.EXT1 CONTENT12
DIR1\FILE1.EXT1 CONTENT13
DIR1\FILE2.EXT1 CONTENT21
DIR2\FILE3.EXT2 CONTENT31
DIR3\FILE3.EXT2 CONTENT11
The list typically contains hundreds of thousands of lines, sometimes several millions.
I want to check that the list contains predefined couples file/content:
FILE1 CONTENT11
FILE1 CONTENT12
FILE3 CONTENT11
I know that I can check that the string contains all of these couples by matching the string against some regexes
"^\S*FILE1\S*\tCONTENT11$"
"^\S*FILE1\S*\tCONTENT12$"
"^\S*FILE3\S*\tCONTENT11$"
import re
def all_matching(str, rxs):
res = True
for rx in rxs:
p = re.compile(rx, re.M)
res = res and p.search(str)
return(res)
input1 = """DIR1\\FILE1.EXT1\tCONTENT11
DIR1\\FILE1.EXT1\tCONTENT12
DIR1\\FILE1.EXT1\tCONTENT13
DIR1\\FILE2.EXT1\tCONTENT21
DIR2\\FILE3.EXT2\tCONTENT31
DIR3\\FILE3.EXT2\tCONTENT11"""
input2 = """DIR1\\FILE1.EXT1\tCONTENT11
DIR1\\FILE1.EXT1\tCONTENT12
DIR1\\FILE1.EXT1\tCONTENT13
DIR1\\FILE2.EXT1\tCONTENT21
DIR2\\FILE3.EXT2\tCONTENT31"""
rxs = [r"^\S*FILE1\S*\tCONTENT11$",r"^\S*FILE1\S*\tCONTENT12$",r"^\S*FILE3\S*\tCONTENT11$"]
if all_matching(input1,rxs):
print("input1 matches all rxs") # excpected
else:
print("input1 do not match all rxs")
if all_matching(input2,rxs):
print("input2 matches all rxs")
else:
print("input2 do not match all rxs") # expected because input2 doesn't match wirh rxs[2]
ideone is available here
However, as the input string is very long in my case, I'd rather avoid launching search many times...
I feel like it should be possible to change the all_matching function in that way.
Any help will be much appreciated!
EDIT
clarified the problem an provided sample code

You may build a single regex from the regex strings you have that will require all the regexes to find a match in the input string.
The resulting regex will look like
\A(?=(?:.*\n)*?\S*FILE1\S*\tCONTENT11$)(?=(?:.*\n)*?\S*FILE1\S*\tCONTENT12$)(?=(?:.*\n)*?\S*FILE3\S*\tCONTENT11$)
See the regex demo.
Basically, it will match:
(?m) - a re.M / re.MULTILINE embedded flag option
\A - start of string (not start of a line!), all the lookaheads below will be triggered one by one, checking the string from the start, until one of them fails
(?=(?:.*\n)*?\S*FILE1\S*\tCONTENT11$) - a positive lookahead that, immediately to the right of the current location, requires the presence of
(?:.*\n)*? - 0 or more (but as few as possible, the pattern will only be tried if the subsequent subpatterns do not match)
\S* - 0+ non-whitespaces
FILE1 - a string
\S* - 0+ non-whitespaces
\tCONTENT11 - tab and CONTENT11 substring
$ - end of line (since (?m) allows $ to match end of lines)
(?=(?:.*\n)*?\S*FILE1\S*\tCONTENT12$) - a lookahead working similarly as the preceding one, requiring FILE1 and CONTENT12 substrings on the line
(?=(?:.*\n)*?\S*FILE3\S*\tCONTENT11$) - a lookahead working similarly as the preceding one, requiring FILE3 and CONTENT11 substrings on the line.
In Python, it will look like
rxs = [r"^\S*FILE1\S*\tCONTENT11$",r"^\S*FILE1\S*\tCONTENT12$",r"^\S*FILE3\S*\tCONTENT11$"]
pat = re.compile( r"(?m)\A(?=(?:.*\n)*?{})".format(r")(?=(?:.*\n)*?".join([rx[1:] for rx in rxs])) )
Then, the check method will look like
def all_matching(s, pat):
return pat.search(s)
See full Python demo online.

Regex breaking with preceding characters

I am trying to parse a phone number from a group of strings by compiling this regex:
exp = re.compile(r'(\+\d|)(([^0-9\s]|)\d\d\d([^0-9\s]|)([^0-9\s]|)\d+([^0-9\s]|)\d+)')
This successfully matches with a line like "+1(123)-456-7890". However, if I add anything in front of it, like "P: +1(123)-456-7890" it does not match. I tested on Regex websites but can't figure this out at all.

You might consider using re.search (which scans) instead of re.match, which only looks at the beginning of the string. You could instead add a .* to the start.

Your regex will return following results
[('+1', '(123)-456-7890', '(', ')', '-', '-')]
If format is fixed you can use something like
phone = re.compile(r"\+\d\(\d+\)-\d+-\d+")
\d - matches digit.
+ - one or more occurrences.
\+ - for matching "+"
\( - for matching "("
str = "P: +1(123)-456-7890"
phone.findall(str)
Output :
['+1(123)-456-7890']

Anchor to End of Last Match

In the process of working on this answer I stumbled on an anomaly with Python's repeating regexes.
Say I'm given a CSV string with an arbitrary number of quoted and unquoted elements:
21, 2, '23.5R25 ETADT', 'description, with a comma'
I want to replace all the ','s outside quotes with '\t'. So I'd like an output of:
21\t2\t'23.5R25 ETADT'\t'description, with a comma'
Since there will be multiple matches in the string naturally I'll use the g regex modifier. The regex I'll use will match characters outside quotes or a quoted string followed by a ',':
('[^']*'|[^',]*),\s*
And I'll replace with:
\1\t
Now the problem is the regex is searching not matching so it can choose to skip characters until it can match. So rather than my desired output I get:
21\t2\t'23.5R25 ETADT'\t'description\twith a comma'
You can see a live example of this behavior here: https://regex101.com/r/sG9hT3/2
Q. Is there a way to anchor a g modified regex to begin matching at the character after the previous match?
For those familiar with Perl's mighty regexs, Perl provides the \G. Which allows us to retrieve the end of the last position matched. So in Perl I could accomplish what I'm asking for with the regex:
\G('[^']*'|[^',]*),\s*
This would force a mismatch within the final quoted element. Because rather than allowing the regex implementation to find a point where the regex matched the \G would force it to begin matching at the first character of:
'description, with a comma'

You can use the following regex with re.search:
,?\s*([^',]*(?:'[^']*'[^',]*)*)
See regex demo (I change it to ,?[ ]*([^',\n]*(?:'[^'\n]*'[^',\n]*)*) since it is a multiline demo)
Here, the regex matches (in a regex meaning of the word)...
,? - 1 or 0 comma
\s* - 0 or more whitespace
([^',]*(?:'[^']*'[^',]*)*) - Group 1 storing a captured text that consists of...
[^',]* - 0 or more characters other than , and '
(?:'[^']*'[^',]*)* - 0 or more sequences of ...
'[^']*' - a 'string'-like substring containing no apostrophes
[^',]* - 0 or more characters other than , and '.
If you want to use a re.match and store the captured texts inside capturing groups, it is not possible since Python regex engine does not store all the captures in a stack as .NET regex engine does with CaptureCollection.
Also, Python regex does not support \G operator, so you cannot anchor any subpattern at the end of a successful match here.
As an alternative/workaround, you can use the following Python code to return successive matches and then the rest of the string:
import re
def successive_matches(pattern,text,pos=0):
ptrn = re.compile(pattern)
match = ptrn.match(text,pos)
while match:
yield match.group()
if match.end() == pos:
break
pos = match.end()
match = ptrn.match(text,pos)
if pos < len(text) - 1:
yield text[pos:]
for matched_text in successive_matches(r"('[^']*'|[^',]*),\s*","21, 2, '23.5R25 ETADT', 'description, with a comma'"):
print matched_text
See IDEONE demo, the output is
21,
2,
'23.5R25 ETADT',
'description, with a comma'

Referencing previous group possible within the same regex?

I am trying to perform a regex in Python. I want to match on a file path that does not have a domain extension and additionally, I only want to get those file paths that have 20 characters max after the last '\' of the file path. For example, given the data:
c:\users\docs\cmd.exe
c:\users\docs\files\ewyrkfdisadfasdfaffsfdasfsafsdf
c:\users\docs\files\target
I would want to match on 'target', and not the other two lines. It should be noted that in my current situation, using the re module or python operations is not an option, as this regex is fed into the program (which uses re.match() ), so I have do to this within a regex string.
I have two regexes:
^([^.]+)$ will match the the last 2 lines
([^\\]{,20}$) will match 'cmd.exe' and 'target'
How can I combine these two into one regex? I tried backreferencing (?P=, etc), but couldn't get it to work. Is this even possible?

How about \\([^\\.]{1,20})(?:$|\n)? It seems to work for me.
\\ is escaped literal backslash.
( start of capture group.
[^\\.] match anything except literal backslash or literal dot character
{1,20} match class 1-20 times, as many times as possible (greedy).
) end the capture group.
(?: starts a non-capturing group
$ match the end of the string.
| is the 'or' operator for this group
\n matches a line-feed or newline character (ASCII 10)
) end of non-capturing group
To create this, I used https://regex101.com/#python which is a very good resource in my opinion because it explains every part of the regex and neatly shows the captured groups in real time.

>>> s = r"""c:\users\docs\cmd.exe
... c:\users\docs\files\ewyrkfdisadfasdfaffsfdasfsafsdf
... c:\users\docs\files\target""".split('\n')
>>> [re.match(r'.*\\([^.]{,20})$', x) for x in s]
[None, None, <_sre.SRE_Match object at 0x7f6ad9631558>]
also
>>> [re.findall(r'.*\\([^.]{,20})$', x) for x in s]
[[], [], ['target']]
This means:
.*\\ - grab everything up to and including the last \
([^.]{,20}) - make sure there are no . in the remaining upto 20 characters
$ - end of line
The () around the middle group indicate that it should be the group returned as the match

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Matching regular expression or leaving empty string when not found - python

Related

Cannot seem to figure out this regex involving forward slash

(Python) How to check a long string against several regex?

Regex breaking with preceding characters

Anchor to End of Last Match

Referencing previous group possible within the same regex?

Categories

Resources