regex for matching german characters in python - python

Could someone help me on regex to match German words/sentences in
python? It does not work on jupyter notebook. I tried same in jsfiddle
it works fine. I tried using this below script but does not work
import re
pattern = re.compile(r'\[^a-zA-Z0-9äöüÄÖÜß]\\', re.UNICODE)
print(pattern.search(text))

Your expression will always fail:
\[^a-zA-Z0-9äöüÄÖÜß]\\
Broken down, you require
[ # literally
^ # start of the line / text
a-z # literally, etc.
The problem is that you require a [ literally right before the start of a line which can never be true (either there's nothing or a newline). So in the end, either remove the backslash to get a proper character class as in:
[^a-zA-Z0-9äöüÄÖÜß]+
But this will surely not match the words you're looking for (quite the opposite). So either use something as simple as \w+ or the solution proposed by #Wiktor in the comments section.

The square brackets define a range of characters you want to look for, however the '^' negates these characters if it appears within the character class.
If you want to specify the beginning of the line you need to put the '^' before the brackets.
Also you need to add a multiplier behind the class to search for more than just one character in this case:
r'^[a-zA-Z0-9äöüÄÖÜß]+'
One ore more characters contained in the brackets are matched as long as they are not seperated by any other character not listed between '[]'
Here's the link to the official documentation

Related

Regex Match on String (DOI)

Hi I'm struggling to understand why my Regex isn't working.
I have URL's that have DOI's on them like so:
https://link.springer.com/10.1007/s00737-021-01116-5
https://journals.sagepub.com/doi/pdf/10.1177/1078390319877228
https://journals.sagepub.com/doi/pdf/10.1177/1078390319877228
https://onlinelibrary.wiley.com/doi/10.1111/jocn.13435
https://journals.sagepub.com/doi/pdf/10.1177/1062860613484171
https://onlinelibrary.wiley.com/resolve/openurl?genre=article&title=Natural+Resources+Forum&issn=0165-0203&volume=26&date=2002&issue=1&spage=3
https://dx.doi.org/10.1108/14664100110397304?nols=y
https://onlinelibrary.wiley.com/doi/10.1111/jocn.15833
https://www.tandfonline.com/doi/pdf/10.1080/03768350802090592?needAccess=true
And I'm using for example this Regex, but it always returns empty?
print(re.findall(r'/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i', 'https://dx.doi.org/10.1108/02652320410549638?nols=y'))
Where have I gone wrong?
It looks like you come from another programming language that has the notion of regex literals that are delimited with forward slashes and have the modifiers following the closing slash (hence /i).
In Python there is no such thing, and these slashes and modifier(s) are taken as literal characters. For flags like i you can use the optional flags parameter of findall.
Secondly, ^ will match the start of the input string, but evidently the URLs you have as input do not start with 10, so that has to go. Instead you could require that the 10 must follow a word break... i.e. it should not be preceded by an alphanumerical character (or underscore).
Similarly, $ will match the end of the input string, but you have URLs that continue with URL parameters, like ?nols=y, so again the part you are interested in does not go on until the end of the input. So that has to go too.
The dot has a special meaning in regex, but you clearly intended to match a literal dot, so it should be escaped.
Finally, alphanumerical characters can be matched with \w, which also matches both lower case and capital Latin letters, so you can shorten the character class a bit and do without any flags such as i (re.I).
This leaves us with:
print(re.findall(r'\b10\.\d{4,9}/[-.;()/:\w]+',
'https://dx.doi.org/10.1108/02652320410549638?nols=y'))

Why does my Python regex not work as expected when including a forward slash?

I'm having a Python issue when I include a not / in my regex.
In the following example I only want to find a match if the string sitting in the first word boundary starts with a digit AND there isn't a / at any point afterwards.
Why does the following regex return 1ab as a group value? I was hoping it wouldn't find a match at all:
text = "1ab/"
regex = r"\b(\d[^/]*?)\b"
Whereas:
text = "1abc"
regex = r"\b(\d[^c]*?)\b"
does not return any match, which is the outcome I want for the / scenario.
Any help would be appreciated.
Thanks,
Roy
You can use a negative lookahead assertion:
r'\b(\d\w*?)\b(?!.*/)' (use flags=re.DOTALL with this or prepend (?s) to the regex)
(?!.*/) states that the rest of the input string does not contain a '/' character. If you don't want '/' to appear just as the next character, then use as the assertion (?!/).
You almost did it. Yet the slash is not alphanumerical and thus cannot be inside word . Therefore it makes no sense to match or prohibit it start and the end of the word. You have to place "not slash" sub-expression [^/] after the end of word. And add a star [^/]* (which matches the sequence of non-slash symbols) to address the case when slashes occurs toward the end of the string rather than immediately after the end of the first word.
Since you target the first word and absence of slash until the very end of string adding symbols of the start end might help. Especially, if you are use re.search. Resulting in
^[\W]*\b(\d\w*)\b[^/]*\Z
You can play with it using an online debugger such as https://regex101.com/r/uO27vU/2
to better understand the expression or tune it.
Above ^ is a start, \Z is the end of sting, \W is for "non-word" symbols, a \w is "word" symbol.
You can remove the first \b I kept it, as perhaps, it would easier for you to understand with it.
The second expression that you tried excludes words ending with c but first does not. ^c stands for any symbol but c and right after it you have \b which denotes the end of the word. Which reads please no "c"s at the end of the word.
Your first expression says pleas no slashes before the end of the word (sequence of alphanumeric) . Which is the case for you test.
Always use a debugger to get explanation of each symbol,test and
tune your expressions regex101.com/r/B6INGg/2
Note that the list of symbols in a word might be affected by flags. When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_].

Regex to replace filepaths in a string when there's more than one in Python

I'm having trouble finding a way to match multiple filepaths in a string while maintaining the rest of the string.
EDIT: forgot to add that the filepath might contain a dot, so edited "username" to user.name"
# filepath always starts with "file:///" and ends with file extension
text = """this is an example text extracted from file:///c:/users/user.name/download/temp/anecdote.pdf
1 of 4 page and I also continue with more text from
another path file:///c:/windows/system32/now with space in name/file (1232).html running out of text to write."""
I've found many answers that work, but fails when theres more than one filepath, also replacing the other characters in between.
import re
fp_pattern = r"file:\/\/\/(\w|\W){1,255}\.[\w]{3,4}"
print(re.sub(fp_pattern, "*IGOTREPLACED*", text, flags=re.MULTILINE))
>>>"this is an example text extracted from *IGOTREPLACED* running out of text to write."
I've also tried using a "stop when after finding a whitespace after the pattern" but I couldn't get one to work:
fp_pattern = r"file:\/\/\/(\w|\W){1,255}\.[\w]{3,4} ([^\s]+)"
>>> 0 matches
Note that {1,255} is a greedy quantifier, and will match as many chars as possible, you need to add ? after it.
However, just using a lazy {1,255}? quantifier won't solve the problem. You need to define where the match should end. It seems you only want to match these URLs when the extension is immediately followed with whitespace or end of string.
Hence, use
fp_pattern = r"file:///.{1,255}?\.\w{3,4}(?!\S)"
See the regex demo
The (?!\S) negative lookahead will fail any match if, immediately to the right of the current location, there is a non-whitespace char. .{1,255}? will match any 1 to 255 chars, as few as possible.
Use in Python as
re.sub(fp_pattern, "*IGOTREPLACED*", text, flags=re.S)
The re.MULTILINE (re.M) flag only redefines ^ and $ anchor behavior making them match start/end of lines rather than the whole string. The re.S flag allows . to match any chars, including line break chars.
Please never use (\w|\W){1,255}?, use .{1,255}? with re.S flag to match any char, else, performance will decrease.
You can try re.findall to find out how many time regex matches in string. Hope this helps.
import re
len(re.findall(pattern, string_to_search))

Regex for multiple lines python

I have the following text:
"
In the Matter of
XYZ-ABCD
Respondent.
"
Stashed away at some part of a pdf file. I am only interested in capturing the
XYZ-ABCD part but apparently the regex I am using in python is not capturing the pattern correctly.
The piece of text I am interested in capturing can appear anywhere within the PDF and I am using the following pattern:
pat = "^\n+In the Matter of\n+(\s+\w+\s*)\n+
(Respondent\.|Respondents\.)\s+$"
This is the regex code I am using to capture
str = re.match(pat,input_str)
Obviously, I have included the \n to take care of the multiple lines,
However, I don't seem to be getting any matches and don't seem to see what I am missing in my pattern that has not included. This also includes partial matches which I don't seem to be getting.
You could use
^\s+In the Matter of\s+(\S+)\s+Respondents?
See a demo on regex101.com (mind the multiline flag).
Some issues with your original expression:
\n != \s # \s includes \n but also other whitespace characters
\w = [A-Z0-9_] # but you wanted to match "-" as well which is not part of \w
Additionally, you had likely neither the multiline nor the verbose flag on but your code snippet looked like you would have needed to.

regex python - using lookbehinds to find my specific text

UPDATED
I want to find a string within a big text
..."img good img two_apple.txt"
Want to extract the two_apples.txt from a text, but it can change to one_apple, three_apple..so on...
When I try to use lookbehinds, it matches text all the way from the beginning.
You are mis-using lookarounds. Looks like you dont even NEED a lookaround:
pattern = r'src="images/(.+?.png")'
should work for you. As my comment suggests though, using regex is not recommended for parsing HTML/XML style documents but you do you.
EDIT - accommodate your edit:
Now that I understand your problem more, I can see why you would want to use a look-around. However, since you are looking for a file name, you know there aren't going to be any spaces in the name, so you can just ensure that your capturing token does not include spaces:
pattern = r'src="img (\w+?.png")'
^ ensure there is a space HERE because of how your text is
\w - \w is equivalent to [a-zA-Z0-9_] (any letters, numbers or underscore)
This removes the greediness of capture the first 'img ' string that pops up and ensures your capture group doesnt have any spaces.
by using \w, I am assuming you are only expecting _ and letter characters. to include anything else, make your own character group with [any characters you want to capture in here]
" ([^ ]+_apple\.txt)"
Starts with a space, ends with _apple.txt. The middle bit is anything-except-a-space which stops it matching "good img two". Parentheses to capture the bit you care about.
Try it here: https://regex101.com/r/wO7lG3/2

Categories