match until a certain pattern using regex - python

I have string in a text file containing some text as follows:
txt = "java.awt.GridBagLayout.layoutContainer"
I am looking to get everything before the Class Name, "GridBagLayout".
I have tried something the following , but I can't figure out how to get rid of the "."
txt = re.findall(r'java\S?[^A-Z]*', txt)
and I get the following: "java.awt."
instead of what I want: "java.awt"
Any pointers as to how I could fix this?

Without using capture groups, you can use lookahead (the (?= ... ) business).
java\s?[^A-Z]*(?=\.[A-Z]) should capture everything you're after. Here it is broken down:
java //Literal word "java"
\s? //Match for an optional space character. (can change to \s* if there can be multiple)
[^A-Z]* //Any number of non-capital-letter characters
(?=\.[A-Z]) //Look ahead for (but don't add to selection) a literal period and a capital letter.

Make your pattern match a period followed by a capital letter:
'(java\S?[^A-Z]*?)\.[A-Z]'
Everything in capture group one will be what you want.

This seems to do what you want with re.findall(): (java\S?[^A-Z]*)\.[A-Z]

Related

Multiline regex in pdf file

I am interested in extracting some information from some PDF files that look like this. I only need the information at pages 2 and after which looks like this:
(U) country: On [date] [text]. (text in brackets)
This means it always starts with a number a dot a country and finishes with brackets which brackets may also go to the next line too.
My implementation in python is the following:
use pdfminer extract_text function to get the whole text.
Then use re.findall function in the whole text using this regex ^\d{1,2}\. \(u\) \w+.\w*.\w*:.* on \d{1,2} \w+.*$ with the re.MULTILINE option too.
I have noticed that this extracts the first line of all the paragraphs that I am interested in, but I cannot find a way to grab everything until the end of the paragraph which is the brackets (.*).
I was wondering if anyone can provide some help into this. I was hoping I can match this by only one regex. Otherwise I might try split it by line and iterate through each one.
Thanks in advance.
You could update the pattern using a negated character class matching until the first occurrence of : and then match at least on after it.
To match all following line, you can match a newline and assert that the nextline does not contain only spaces followed by a newline using a negative lookahead.
Using a case insensitive match:
^\d{1,2}\.\s\(u\)\s[^:\n]*:.*?\son\s\d{1,2}\s.*(?:\n(?![^\S\r\n]*\n).*)*
The pattern matches:
^ Start of string
\d{1,2}\.\s\(u\)\s Match 2 digits, . a whitespace char and (u)
[^:\n]*: Match any char except : or a newline, then match :
.*?\son\s Match the first occurrence of on between whitespace chars
\d{1,2}\s Match 1-2 digits and a whitespace char
.* Match the rest of the line
(?: Non capture group
\n(?![^\S\r\n]*\n).* Match a newline, and assert not only spaces followed by a newline
)* Close non capture group and optionally repeat
Regex demo
For example
pattern = r"^\d{1,2}\.\s\(u\)\s[^:]*:.*?\son\s\d{1,2}\s.*(?:\n(?![^\S\r\n]*\n).*)*"
print(re.findall(pattern, extracted_text, re.M | re.I))

Python path regex optional match

I have path strings like these two:
tree/bee.horse_2021/moose/loo.se
bee.horse_2021/moose/loo.se
bee.horse_2021/mo.ose/loo.se
The path can be arbitrarily long after moose. Sometimes the first part of the path such as tree/ is missing, sometimes not. I want to capture tree in the first group if it exists and bee.horse in the second.
I came up with this regex, but it doesn't work:
path_regex = r'^(?:(.*)/)?([a-zA-Z]+\.[a-zA-Z]+).+$'
What am I missing here?
You can restrict the characters to be matched in the first capture group.
For example, you could match any character except / or . using a negated character class [^/\n.]+
^(?:([^/\n.]+)/)?([a-zA-Z]+\.[a-zA-Z]+).*$
Regex demo
Or you can restrict the characters to match word characters \w+ only
^(?:(\w+)/)?([a-zA-Z]+\.[a-zA-Z]+).*$
Regex demo
Note that in your pattern, the .+ at the end matches as least a single character. If you want to make that part optional, you can change it to .*

Regex to replace filepaths in a string when there's more than one in Python

I'm having trouble finding a way to match multiple filepaths in a string while maintaining the rest of the string.
EDIT: forgot to add that the filepath might contain a dot, so edited "username" to user.name"
# filepath always starts with "file:///" and ends with file extension
text = """this is an example text extracted from file:///c:/users/user.name/download/temp/anecdote.pdf
1 of 4 page and I also continue with more text from
another path file:///c:/windows/system32/now with space in name/file (1232).html running out of text to write."""
I've found many answers that work, but fails when theres more than one filepath, also replacing the other characters in between.
import re
fp_pattern = r"file:\/\/\/(\w|\W){1,255}\.[\w]{3,4}"
print(re.sub(fp_pattern, "*IGOTREPLACED*", text, flags=re.MULTILINE))
>>>"this is an example text extracted from *IGOTREPLACED* running out of text to write."
I've also tried using a "stop when after finding a whitespace after the pattern" but I couldn't get one to work:
fp_pattern = r"file:\/\/\/(\w|\W){1,255}\.[\w]{3,4} ([^\s]+)"
>>> 0 matches
Note that {1,255} is a greedy quantifier, and will match as many chars as possible, you need to add ? after it.
However, just using a lazy {1,255}? quantifier won't solve the problem. You need to define where the match should end. It seems you only want to match these URLs when the extension is immediately followed with whitespace or end of string.
Hence, use
fp_pattern = r"file:///.{1,255}?\.\w{3,4}(?!\S)"
See the regex demo
The (?!\S) negative lookahead will fail any match if, immediately to the right of the current location, there is a non-whitespace char. .{1,255}? will match any 1 to 255 chars, as few as possible.
Use in Python as
re.sub(fp_pattern, "*IGOTREPLACED*", text, flags=re.S)
The re.MULTILINE (re.M) flag only redefines ^ and $ anchor behavior making them match start/end of lines rather than the whole string. The re.S flag allows . to match any chars, including line break chars.
Please never use (\w|\W){1,255}?, use .{1,255}? with re.S flag to match any char, else, performance will decrease.
You can try re.findall to find out how many time regex matches in string. Hope this helps.
import re
len(re.findall(pattern, string_to_search))

Regex find word including "-"

I have the below regex (from this link: get python dictionary from string containing key value pairs)
r"\b(\w+)\s*:\s*([^:]*)(?=\s+\w+\s*:|$)"
Here is the explanation:
\b # Start at a word boundary
(\w+) # Match and capture a single word (1+ alnum characters)
\s*:\s* # Match a colon, optionally surrounded by whitespace
([^:]*) # Match any number of non-colon characters
(?= # Make sure that we stop when the following can be matched:
\s+\w+\s*: # the next dictionary key
| # or
$ # the end of the string
) # End of lookahead
My question is that when my string has the word with the "-" in between, for example: movie-night, the above regex is not working and I think it is due to the b(\w+). How can I change this regex to work with word including the "-"? I have tried b(\w+-) but it does not work. Thanks for your help in advance.
You could try something such as this:
r"\b([\w\-]+)\s*:\s*([^:]*)(?=\s+\w+\s*:|$)"
Note the [\w\-]+, which allows matching both a word character and a dash.
For readability in the future, you may also want to investigate re.X/re.VERBOSE, which can make regex more readable.

regex python - using lookbehinds to find my specific text

UPDATED
I want to find a string within a big text
..."img good img two_apple.txt"
Want to extract the two_apples.txt from a text, but it can change to one_apple, three_apple..so on...
When I try to use lookbehinds, it matches text all the way from the beginning.
You are mis-using lookarounds. Looks like you dont even NEED a lookaround:
pattern = r'src="images/(.+?.png")'
should work for you. As my comment suggests though, using regex is not recommended for parsing HTML/XML style documents but you do you.
EDIT - accommodate your edit:
Now that I understand your problem more, I can see why you would want to use a look-around. However, since you are looking for a file name, you know there aren't going to be any spaces in the name, so you can just ensure that your capturing token does not include spaces:
pattern = r'src="img (\w+?.png")'
^ ensure there is a space HERE because of how your text is
\w - \w is equivalent to [a-zA-Z0-9_] (any letters, numbers or underscore)
This removes the greediness of capture the first 'img ' string that pops up and ensures your capture group doesnt have any spaces.
by using \w, I am assuming you are only expecting _ and letter characters. to include anything else, make your own character group with [any characters you want to capture in here]
" ([^ ]+_apple\.txt)"
Starts with a space, ends with _apple.txt. The middle bit is anything-except-a-space which stops it matching "good img two". Parentheses to capture the bit you care about.
Try it here: https://regex101.com/r/wO7lG3/2

Categories