Regex for multiple lines python - python

I have the following text:
"
In the Matter of
XYZ-ABCD
Respondent.
"
Stashed away at some part of a pdf file. I am only interested in capturing the
XYZ-ABCD part but apparently the regex I am using in python is not capturing the pattern correctly.
The piece of text I am interested in capturing can appear anywhere within the PDF and I am using the following pattern:
pat = "^\n+In the Matter of\n+(\s+\w+\s*)\n+
(Respondent\.|Respondents\.)\s+$"
This is the regex code I am using to capture
str = re.match(pat,input_str)
Obviously, I have included the \n to take care of the multiple lines,
However, I don't seem to be getting any matches and don't seem to see what I am missing in my pattern that has not included. This also includes partial matches which I don't seem to be getting.

You could use
^\s+In the Matter of\s+(\S+)\s+Respondents?
See a demo on regex101.com (mind the multiline flag).
Some issues with your original expression:
\n != \s # \s includes \n but also other whitespace characters
\w = [A-Z0-9_] # but you wanted to match "-" as well which is not part of \w
Additionally, you had likely neither the multiline nor the verbose flag on but your code snippet looked like you would have needed to.

Related

regex for matching german characters in python

Could someone help me on regex to match German words/sentences in
python? It does not work on jupyter notebook. I tried same in jsfiddle
it works fine. I tried using this below script but does not work
import re
pattern = re.compile(r'\[^a-zA-Z0-9äöüÄÖÜß]\\', re.UNICODE)
print(pattern.search(text))
Your expression will always fail:
\[^a-zA-Z0-9äöüÄÖÜß]\\
Broken down, you require
[ # literally
^ # start of the line / text
a-z # literally, etc.
The problem is that you require a [ literally right before the start of a line which can never be true (either there's nothing or a newline). So in the end, either remove the backslash to get a proper character class as in:
[^a-zA-Z0-9äöüÄÖÜß]+
But this will surely not match the words you're looking for (quite the opposite). So either use something as simple as \w+ or the solution proposed by #Wiktor in the comments section.
The square brackets define a range of characters you want to look for, however the '^' negates these characters if it appears within the character class.
If you want to specify the beginning of the line you need to put the '^' before the brackets.
Also you need to add a multiplier behind the class to search for more than just one character in this case:
r'^[a-zA-Z0-9äöüÄÖÜß]+'
One ore more characters contained in the brackets are matched as long as they are not seperated by any other character not listed between '[]'
Here's the link to the official documentation

Regex to replace filepaths in a string when there's more than one in Python

I'm having trouble finding a way to match multiple filepaths in a string while maintaining the rest of the string.
EDIT: forgot to add that the filepath might contain a dot, so edited "username" to user.name"
# filepath always starts with "file:///" and ends with file extension
text = """this is an example text extracted from file:///c:/users/user.name/download/temp/anecdote.pdf
1 of 4 page and I also continue with more text from
another path file:///c:/windows/system32/now with space in name/file (1232).html running out of text to write."""
I've found many answers that work, but fails when theres more than one filepath, also replacing the other characters in between.
import re
fp_pattern = r"file:\/\/\/(\w|\W){1,255}\.[\w]{3,4}"
print(re.sub(fp_pattern, "*IGOTREPLACED*", text, flags=re.MULTILINE))
>>>"this is an example text extracted from *IGOTREPLACED* running out of text to write."
I've also tried using a "stop when after finding a whitespace after the pattern" but I couldn't get one to work:
fp_pattern = r"file:\/\/\/(\w|\W){1,255}\.[\w]{3,4} ([^\s]+)"
>>> 0 matches
Note that {1,255} is a greedy quantifier, and will match as many chars as possible, you need to add ? after it.
However, just using a lazy {1,255}? quantifier won't solve the problem. You need to define where the match should end. It seems you only want to match these URLs when the extension is immediately followed with whitespace or end of string.
Hence, use
fp_pattern = r"file:///.{1,255}?\.\w{3,4}(?!\S)"
See the regex demo
The (?!\S) negative lookahead will fail any match if, immediately to the right of the current location, there is a non-whitespace char. .{1,255}? will match any 1 to 255 chars, as few as possible.
Use in Python as
re.sub(fp_pattern, "*IGOTREPLACED*", text, flags=re.S)
The re.MULTILINE (re.M) flag only redefines ^ and $ anchor behavior making them match start/end of lines rather than the whole string. The re.S flag allows . to match any chars, including line break chars.
Please never use (\w|\W){1,255}?, use .{1,255}? with re.S flag to match any char, else, performance will decrease.
You can try re.findall to find out how many time regex matches in string. Hope this helps.
import re
len(re.findall(pattern, string_to_search))

Regex - Why won't this regex work in Python?

I have this expression
:([^"]*) \(([^"]*)\)
and this text
:chkpf_uid ("{4astr-hn389-918ks}")
:"#cert" ("false")
Im trying to match it so that on the first sentence ill get these groups:
chkpf_uid
{4astr-hn389-918ks}
and on the second, ill get these:
#cert
false
I want to avoid getting the quotes.
I can't seem to understand why the expression I use won't match these, especially if I switch the [^"]* to a (.*).
with ([^"]*): wont match
with (.*): does match, but with quotes
This is using the re module in python 2.7
Sidenote: your input may require a specific parser to handle, especially if it may have escape sequences.
Answering the question itself, remember that a regex is processed from left to right sequentially, and the string is processed the same here. A match is returned if the pattern matches a portion/whole string (depending on the method used).
If there are quotation marks in the string, and your pattern does not let match those quotes, the match will be failed, no match will be returned.
A possible solution can be adding the quotes as otpional subpatterns:
:"?([^"]*)"? \("?([^"]*)"?\)
^^ ^^ ^^ ^^
See the regex demo
The parts you need are captured into groups, and the quotes, present or not, are just matched, left out of your re.findall reach.

regex python - using lookbehinds to find my specific text

UPDATED
I want to find a string within a big text
..."img good img two_apple.txt"
Want to extract the two_apples.txt from a text, but it can change to one_apple, three_apple..so on...
When I try to use lookbehinds, it matches text all the way from the beginning.
You are mis-using lookarounds. Looks like you dont even NEED a lookaround:
pattern = r'src="images/(.+?.png")'
should work for you. As my comment suggests though, using regex is not recommended for parsing HTML/XML style documents but you do you.
EDIT - accommodate your edit:
Now that I understand your problem more, I can see why you would want to use a look-around. However, since you are looking for a file name, you know there aren't going to be any spaces in the name, so you can just ensure that your capturing token does not include spaces:
pattern = r'src="img (\w+?.png")'
^ ensure there is a space HERE because of how your text is
\w - \w is equivalent to [a-zA-Z0-9_] (any letters, numbers or underscore)
This removes the greediness of capture the first 'img ' string that pops up and ensures your capture group doesnt have any spaces.
by using \w, I am assuming you are only expecting _ and letter characters. to include anything else, make your own character group with [any characters you want to capture in here]
" ([^ ]+_apple\.txt)"
Starts with a space, ends with _apple.txt. The middle bit is anything-except-a-space which stops it matching "good img two". Parentheses to capture the bit you care about.
Try it here: https://regex101.com/r/wO7lG3/2

how to replace markdown tags into html by python?

I want to replace some "markdown" tags into html tags.
for example:
#Title1#
##title2##
Welcome to **My Home Page**
will be turned into
<h1>Title1</h1>
<h2>title2</h2>
Welcome to <b>My Home Page</b>
I just don't know how to do that...For Title1,I tried this:
#!/usr/bin/env python3
import re
text = '''
#Title1#
##title2##
'''
p = re.compile('^#\w*#\n$')
print(p.sub('<h1>\w*</h1>',text))
but nothing happens..
#Title1#
##title2##
How could those bbcode/markdown language come into html tags?
Check this regex: demo
Here you can see how I substituted the #...# into <h1>...</h1>.
I believe you can get this to work with double # and so on to get other markdown features considered, but still you should listen to #Thomas and #nhahtdh comments and use a markdown parser. Using regexes in such cases is unreliable, slow and unsafe.
As for inline text like **...** to <b>...</b> you can try this regex with substitution: demo. Hope you can twink this for other features like underlining and so on.
Your regular expression does not work because in the default mode, ^ and $ (respectively) matches the beginning and the end of the whole string.
'^'
(Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline (my emph.)
'$'
Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. foo matches both ‘foo’ and ‘foobar’, while the regular expression foo$ matches only ‘foo’. More interestingly, searching for foo.$ in 'foo1\nfoo2\n' matches ‘foo2’ normally, but ‘foo1’ in MULTILINE mode; searching for a single $ in 'foo\n' will find two (empty) matches: one just before the newline, and one at the end of the string.
(7.2.1. Regular Expression Syntax)
Add the flag re.MULTILINE in your compile line:
p = re.compile('^#(\w*)#\n$', re.MULTILINE)
and it should work – at least for single words, such as your example. A better check would be
p = re.compile('^#([^#]*)#\n$', re.MULTILINE)
– any sequence that does not contain a #.
In both expressions, you need to add parentheses around the part you want to copy so you can use that text in your replacement code. See the official documentation on Grouping for that.

Categories