Regex breaking with preceding characters - python

I am trying to parse a phone number from a group of strings by compiling this regex:
exp = re.compile(r'(\+\d|)(([^0-9\s]|)\d\d\d([^0-9\s]|)([^0-9\s]|)\d+([^0-9\s]|)\d+)')
This successfully matches with a line like "+1(123)-456-7890". However, if I add anything in front of it, like "P: +1(123)-456-7890" it does not match. I tested on Regex websites but can't figure this out at all.

You might consider using re.search (which scans) instead of re.match, which only looks at the beginning of the string. You could instead add a .* to the start.

Your regex will return following results
[('+1', '(123)-456-7890', '(', ')', '-', '-')]
If format is fixed you can use something like
phone = re.compile(r"\+\d\(\d+\)-\d+-\d+")
\d - matches digit.
+ - one or more occurrences.
\+ - for matching "+"
\( - for matching "("
str = "P: +1(123)-456-7890"
phone.findall(str)
Output :
['+1(123)-456-7890']

Related

Python regular expression truncate string by special character with one leading space

I need to truncate string by special characters '-', '(', '/' with one leading whitespace, i.e. ' -', ' (', ' /'.
how to do that?
patterns=r'[-/()]'
try:
return row.split(re.findall(patterns, row)[0], 1)[0]
except:
return row
the above code picked up all special characters but without the leading space.
patterns=r'[s-/()]'
this one does not work.
Try this pattern
patterns=r'^\s[-/()]'
or remove ^ depending on your needs.
It looks like you want to get a part of the string before the first occurrence of \s[-(/] pattern.
Use
return re.sub(r'\s[-(/].*', '', row)
This code will return a part of row string without all chars after the first occurrence of a whitespace (\s) followed with -, ( or / ([-(/]).
See the regex demo.
Please try this pattern patterns = r'\s+-|\s\/|\s\(|\s\)'

Matching regular expression or leaving empty string when not found

this is my first post on this site so tell me if I mess something up. I need to find config files for files of the same name, with the difference being config files have 'str' at the end of them.
some characters + _digit + car + some more characters + str or nothing.
All files are in text form, so extension doesn't give any more information. Included in file name are also some important information, like number of occurrence, which i need to extract as well.
My approach using regex boils down to this
import re
reg = '(.*(?=\\dcar))(\\d(?=car)).*(str)?'
config_to_file1 = 'wts-lg-000191_0car_lp_str'
file1 = 'wts-lg-000191_0car_lp'
print(re.findall(reg,file1))
print(re.findall(reg,config_to_file1))
i also tried this
reg = '(.*(?=\\dcar))(\\d(?=car)).*(str)+'
I expected to get this:
[('wts-lg-000191_', '0', 'str')]
[('wts-lg-000191_', '0', '')]
But got this instead:
[('wts-lg-000191_', '0', '')]
[('wts-lg-000191_', '0', '')]
I know i don't use ? token properly, I tried looking around and I don't know what am i missing. I also want to stick with regular expression approach for practice purpose.
The main reason your regex fails is that .* before (str)? grabs the whole string to the end, and (str)? just matches the end of string position since it does not have to consume any chars (as it is optional).
However, your regex can be greatly optimized as you are overusing lookarounds. Use
reg = r'(.*?)(\d)car(?:.*(str))?'
Or
reg = r'(.*?)(\d+)car(?:.*(str))?'
See this Python demo and the regex demo.
Details
(.*?) - Group 1: any 0+ chars other than line break chars as few as possible
(\d+) - Group 2: one or more digits
car - a car string
(?:.*(str))? - an optional non-capturing group that matches 1 or 0 occurrences of
.* - any 0+ chars other than line break chars as many as possible
(str) - Group 3: str substring.

How to match and replace this pattern in Python RE?

s = "[abc]abx[abc]b"
s = re.sub("\[([^\]]*)\]a", "ABC", s)
'ABCbx[abc]b'
In the string, s, I want to match 'abc' when it's enclosed in [], and followed by a 'a'. So in that string, the first [abc] will be replaced, and the second won't.
I wrote the pattern above, it matches:
match anything starting with a '[', followed by any number of characters which is not ']', then followed by the character 'a'.
However, in the replacement, I want the string to be like:
[ABC]abx[abc]b . // NOT ABCbx[abc]b
Namely, I don't want the whole matched pattern to be replaced, but only anything with the bracket []. How to achieve that?
match.group(1) will return the content in []. But how to take advantage of this in re.sub?
Why not simply include [ and ] in the substitution?
s = re.sub("\[([^\]]*)\]a", "[ABC]a", s)
There exist more than 1 method, one of them is exploting groups.
import re
s = "[abc]abx[abc]b"
out = re.sub('(\[)([^\]]*)(\]a)', r'\1ABC\3', s)
print(out)
Output:
[ABC]abx[abc]b
Note that there are 3 groups (enclosed in brackets) in first argument of re.sub, then I refer to 1st and 3rd (note indexing starts at 1) so they remain unchanged, instead of 2nd group I put ABC. Second argument of re.sub is raw string, so I do not need to escape \.
This regex uses lookarounds for the prefix/suffix assertions, so that the match text itself is only "abc":
(?<=\[)[^]]*(?=\]a)
Example: https://regex101.com/r/NDlhZf/1
So that's:
(?<=\[) - positive look-behind, asserting that a literal [ is directly before the start of the match
[^]]* - any number of non-] characters (the actual match)
(?=\]a) - positive look-ahead, asserting that the text ]a directly follows the match text.

Anchor to End of Last Match

In the process of working on this answer I stumbled on an anomaly with Python's repeating regexes.
Say I'm given a CSV string with an arbitrary number of quoted and unquoted elements:
21, 2, '23.5R25 ETADT', 'description, with a comma'
I want to replace all the ','s outside quotes with '\t'. So I'd like an output of:
21\t2\t'23.5R25 ETADT'\t'description, with a comma'
Since there will be multiple matches in the string naturally I'll use the g regex modifier. The regex I'll use will match characters outside quotes or a quoted string followed by a ',':
('[^']*'|[^',]*),\s*
And I'll replace with:
\1\t
Now the problem is the regex is searching not matching so it can choose to skip characters until it can match. So rather than my desired output I get:
21\t2\t'23.5R25 ETADT'\t'description\twith a comma'
You can see a live example of this behavior here: https://regex101.com/r/sG9hT3/2
Q. Is there a way to anchor a g modified regex to begin matching at the character after the previous match?
For those familiar with Perl's mighty regexs, Perl provides the \G. Which allows us to retrieve the end of the last position matched. So in Perl I could accomplish what I'm asking for with the regex:
\G('[^']*'|[^',]*),\s*
This would force a mismatch within the final quoted element. Because rather than allowing the regex implementation to find a point where the regex matched the \G would force it to begin matching at the first character of:
'description, with a comma'
You can use the following regex with re.search:
,?\s*([^',]*(?:'[^']*'[^',]*)*)
See regex demo (I change it to ,?[ ]*([^',\n]*(?:'[^'\n]*'[^',\n]*)*) since it is a multiline demo)
Here, the regex matches (in a regex meaning of the word)...
,? - 1 or 0 comma
\s* - 0 or more whitespace
([^',]*(?:'[^']*'[^',]*)*) - Group 1 storing a captured text that consists of...
[^',]* - 0 or more characters other than , and '
(?:'[^']*'[^',]*)* - 0 or more sequences of ...
'[^']*' - a 'string'-like substring containing no apostrophes
[^',]* - 0 or more characters other than , and '.
If you want to use a re.match and store the captured texts inside capturing groups, it is not possible since Python regex engine does not store all the captures in a stack as .NET regex engine does with CaptureCollection.
Also, Python regex does not support \G operator, so you cannot anchor any subpattern at the end of a successful match here.
As an alternative/workaround, you can use the following Python code to return successive matches and then the rest of the string:
import re
def successive_matches(pattern,text,pos=0):
ptrn = re.compile(pattern)
match = ptrn.match(text,pos)
while match:
yield match.group()
if match.end() == pos:
break
pos = match.end()
match = ptrn.match(text,pos)
if pos < len(text) - 1:
yield text[pos:]
for matched_text in successive_matches(r"('[^']*'|[^',]*),\s*","21, 2, '23.5R25 ETADT', 'description, with a comma'"):
print matched_text
See IDEONE demo, the output is
21,
2,
'23.5R25 ETADT',
'description, with a comma'

Regexp Word within a word with a fullstop

I'm having trouble matching a string with regexp (I'm not that experienced with regexp). I have a string which contains a forward slash after each word and a tag. An example:
led/O by/O Timothy/PERSON R./PERSON Geithner/PERSON ,/O the/O president/O of/O the/O New/ORGANIZATION
In those strings, I am only interested in all strings that precede /PERSON. Here's the regexp pattern that I came up with:
(\w)*\/PERSON
And my code:
match = re.findall(r'(\w)*\/PERSON', string)
Basically, I am matching any word that comes before /PERSON. The output:
>>> reg
['Timothy', '', 'Geithner']
My problem is that the second match, matched to an empty string as for R./PERSON, the dot is not a word character. I changed my regexp to:
match = re.findall(r'(\w|.*?)\/PERSON', string)
But the match now is:
['led/O by/O Timothy', ' R.', ' Geithner']
It is taking everything prior to the first /PERSON which includes led/O by/O instead of just matching Timothy. Could someone please help me on how to do this matching, while including a full stop as an abbreviation? Or at least, not have an empty string match?
Thanks,
Match everything but a space character ([^ ]*). You also need the star (*) inside the capture:
match = re.findall(r'([^ ]*)\/PERSON', string)
Firstly, (\w|.) matches "a word character, or any character" (dot matches any character which is why you're getting those spaces).
Escaping this with a backslash will do the trick: (\w|\.)
Second, as #Ionut Hulub points out you may want to use + instead of * to ensure you match something but Regular Expressions work on the principle of "leftmost, longest" so it'll always try to match the longest part that it can before the slash.
If you want to match any non-whitespace character you can use \S instead of (\w|\.), which may actually be what you want.

Categories