RegEx do not match - python

string:
"Btw-nummer: NL855162508B01
NL855162508B02
"
Regex code used:
(^((?!NL855162508B01).))([A-Za-z]{2}\d{9}[A-Za-z]\d{2})
Regex do not match:
NL855162508B01
But do match:
NL855162508B02
As seen in this Regexr I have used:
https://regexr.com/5im28
Desired behavior:
match NL855162508B02
Can you guys help?

You were almost there, but this part (?!NL855162508B01). first matches any character except a newline due to the .
You are using 3 capturing groups, which can all be omitted if you need a match only.
To also match the string when it is not directly at the start, you can omit the anchor ^ and use word boundaries \b
\b(?!NL855162508B01\b)[A-Za-z]{2}\d{9}[A-Za-z]\d{2}\b
Regex demo

Related

Multiline regex in pdf file

I am interested in extracting some information from some PDF files that look like this. I only need the information at pages 2 and after which looks like this:
(U) country: On [date] [text]. (text in brackets)
This means it always starts with a number a dot a country and finishes with brackets which brackets may also go to the next line too.
My implementation in python is the following:
use pdfminer extract_text function to get the whole text.
Then use re.findall function in the whole text using this regex ^\d{1,2}\. \(u\) \w+.\w*.\w*:.* on \d{1,2} \w+.*$ with the re.MULTILINE option too.
I have noticed that this extracts the first line of all the paragraphs that I am interested in, but I cannot find a way to grab everything until the end of the paragraph which is the brackets (.*).
I was wondering if anyone can provide some help into this. I was hoping I can match this by only one regex. Otherwise I might try split it by line and iterate through each one.
Thanks in advance.
You could update the pattern using a negated character class matching until the first occurrence of : and then match at least on after it.
To match all following line, you can match a newline and assert that the nextline does not contain only spaces followed by a newline using a negative lookahead.
Using a case insensitive match:
^\d{1,2}\.\s\(u\)\s[^:\n]*:.*?\son\s\d{1,2}\s.*(?:\n(?![^\S\r\n]*\n).*)*
The pattern matches:
^ Start of string
\d{1,2}\.\s\(u\)\s Match 2 digits, . a whitespace char and (u)
[^:\n]*: Match any char except : or a newline, then match :
.*?\son\s Match the first occurrence of on between whitespace chars
\d{1,2}\s Match 1-2 digits and a whitespace char
.* Match the rest of the line
(?: Non capture group
\n(?![^\S\r\n]*\n).* Match a newline, and assert not only spaces followed by a newline
)* Close non capture group and optionally repeat
Regex demo
For example
pattern = r"^\d{1,2}\.\s\(u\)\s[^:]*:.*?\son\s\d{1,2}\s.*(?:\n(?![^\S\r\n]*\n).*)*"
print(re.findall(pattern, extracted_text, re.M | re.I))

Regex for parsing uid from URL

I am trying to parse UIDs from URLs. However regex is not something I am good at so seeking for some help.
Example Input:
https://example.com/d/iazs9fEil/somethingelse?foo=bar
Example Output:
iazs9fEil
What I've tried so far is
([/d/]+[\d\x])\w+
Which somehow works, but returns in with the /d/ prefix, so the output is /d/iazs9fEil.
How to change the regex to not contain the /d/ prefix?
EDIT:
I've tried this regex ([^/d/]+[\d\x])\w+ which outputs the correct string which is iazs9fEil, but also returns the rest of the url, so here it is somethingelse?foo=bar
In short, you may use
match = re.search(r'/d/(\w+)', your_string) # Look for a match
if match: # Check if there is a match first
print(match.group(1)) # Now, get Group 1 value
See this regex demo and a regex graph:
NOTE
/ is not any special metacharacter, do not escape it in Python string patterns
([/d/]+[\d\x])\w+ matches and captures into Group 1 any one or more slashes or digits (see [/d/]+, a positive character class) and then a digit or (here, Python shows an error: sre_contants.error incomplete escape \x, probably it could parse it as x, but it is not the case), and then matches 1+ word chars. You put the /d/ into a character class and it stopped matching a char sequence, [/d/]+ matches slashes and digits in any order and amount, and certainly places this string into Group 1.
Try (?<=/d/)[^/]+
Explanation:
(?<=/d/) - positive lookbehind, assure that what's preceeding is /d/
[^/]+ - match one or more characters other than /, so it matches everything until /
Demo
You could use a capturing group:
https?://.*?/d/([^/\s]+)
Regex demo

With python regex, only those match my pattern but NOT contains '='

I'm new to regex. I'm using regex to match urls, some of the result contains =. I want only those url match my pattern but not contains =.
My pattern is \S+google.com\/\S+-\S+
For example:
MATCH: www.google.com/aa-bb
NOT MATCH: google.com/
NOT MATCH: www.google.com/aa-bb=cc
My current pattern matches 1 and 3, but I want 1 ONLY
With this answer, Regular expression to match a line that doesn't contain a word?, I have tried (?=((?!=).)*)(?=\S+google.com\/\S+-\S+), trying to intersect the result of both match. But it seems regex does not work this way.
Python Regex answers only, please. Thanks!
Change \S to [^\s=] so it doesn't match spaces or =.
You should also anchor the pattern with ^ and $, so it has to match the entire URL. Otherwise it will match the www.google.com/aa-bb part of www.google.com/aa-bb=cc.
^\S+google\.com\/[^\s=]+-[^\s=]+$
You should also escape literal . in the regexp.

Regex Pattern For Finding String Before First Dot in Python

I need a regex pattern to grab the string before the first dot:
google.com.com
yahoo.com
192.168.1.4
I need a regex that gives google and yahoo, but my pattern grabs IP addresses too. My regex is r'(.*)\.(.*)
Any advice would be appreciated.
I'm assuming that the text you provided is one string with newlines. In that case, this would be a quick solution to pull the results into a list.
re.findall(r'^\s*([a-zA-Z]+)\.', str, re.MULTILINE)
The re.MULTILINE simplifies things by allowing the use of '^' to match the beginning of each line.
This results in ['google', 'yahoo'].
For an explanation of the regex, see the verbose version below. (also good for documenting purposes)
re.findall(r'''
^ # beginning of each line (multiline mode)
\s* # zero or more whitespace characters
([a-zA-Z]+) # captures one or more characters a-z case-insensitive (just in case)
\. # matches '.'
''', str, re.MULTILINE | re.VERBOSE)
I wasn't sure if the spaces on the beginning of the rows have any meaning, so I took them in account in this regex:
(?:^|(?<=\s))[\w-]+(?=.[a-zA-Z])
It matches substrings right after either the beginning of the row, or a space, that are followed by a dot and then a letter (as opposed to a digit).

Regex, not statement

Heyho,
I have the regex
([ ;(\{\}),\[\'\"]?)(_[a-zA-Z_\-0-9]*)([ =;\/*\-+\]\"\'\}\{,]?)
to match every occurrence of
_var
Problem is that it also matches strings like
test_var
I tried to add a new matching group negating any word character but it didn't worked properly.
Can someone figure out what I have to do to not match strings like var_var?
Thanks for help!
You can use the following "fix":
([[ ;(){},'"]?)(\b_[a-zA-Z_0-9-]*\b)([] =;/*+"'{},-]?)
^ ^
See regex demo
The word boundary \b is an anchor that asserts the position between a word and a non-word boundary. That means your _var will never match if preceded with a letter, a digit, or a . Also, I removed overescaping inside the character classes in the optional capturing groups. Note the so-called "smart placement" of hyphens and square brackets that for a Python regex might be not that important, but is still a best practice in writing regexes. Also, in Python regex you don't need to escape / since there are no regex delimiters there.
And one more hint: without u modifier, \w matches [a-zA-Z0-9_], so you can write the regex as
([[ ;(){},'"]?)(\b_[\w-]*\b)([] =;/*+"'{},-]?)
See regex demo 2.
And an IDEONE demo (note the use of r'...'):
import re
p = re.compile(r'([[ ;(){},\'"]?)(\b_[\w-]*\b)([] =;/*+"\'{},-]?)')
test_str = "Some text _var and test_var"
print (re.findall(p, test_str))

Categories