Multiline regex in pdf file - python

I am interested in extracting some information from some PDF files that look like this. I only need the information at pages 2 and after which looks like this:
(U) country: On [date] [text]. (text in brackets)
This means it always starts with a number a dot a country and finishes with brackets which brackets may also go to the next line too.
My implementation in python is the following:
use pdfminer extract_text function to get the whole text.
Then use re.findall function in the whole text using this regex ^\d{1,2}\. \(u\) \w+.\w*.\w*:.* on \d{1,2} \w+.*$ with the re.MULTILINE option too.
I have noticed that this extracts the first line of all the paragraphs that I am interested in, but I cannot find a way to grab everything until the end of the paragraph which is the brackets (.*).
I was wondering if anyone can provide some help into this. I was hoping I can match this by only one regex. Otherwise I might try split it by line and iterate through each one.
Thanks in advance.

You could update the pattern using a negated character class matching until the first occurrence of : and then match at least on after it.
To match all following line, you can match a newline and assert that the nextline does not contain only spaces followed by a newline using a negative lookahead.
Using a case insensitive match:
^\d{1,2}\.\s\(u\)\s[^:\n]*:.*?\son\s\d{1,2}\s.*(?:\n(?![^\S\r\n]*\n).*)*
The pattern matches:
^ Start of string
\d{1,2}\.\s\(u\)\s Match 2 digits, . a whitespace char and (u)
[^:\n]*: Match any char except : or a newline, then match :
.*?\son\s Match the first occurrence of on between whitespace chars
\d{1,2}\s Match 1-2 digits and a whitespace char
.* Match the rest of the line
(?: Non capture group
\n(?![^\S\r\n]*\n).* Match a newline, and assert not only spaces followed by a newline
)* Close non capture group and optionally repeat
Regex demo
For example
pattern = r"^\d{1,2}\.\s\(u\)\s[^:]*:.*?\son\s\d{1,2}\s.*(?:\n(?![^\S\r\n]*\n).*)*"
print(re.findall(pattern, extracted_text, re.M | re.I))

Related

Python path regex optional match

I have path strings like these two:
tree/bee.horse_2021/moose/loo.se
bee.horse_2021/moose/loo.se
bee.horse_2021/mo.ose/loo.se
The path can be arbitrarily long after moose. Sometimes the first part of the path such as tree/ is missing, sometimes not. I want to capture tree in the first group if it exists and bee.horse in the second.
I came up with this regex, but it doesn't work:
path_regex = r'^(?:(.*)/)?([a-zA-Z]+\.[a-zA-Z]+).+$'
What am I missing here?
You can restrict the characters to be matched in the first capture group.
For example, you could match any character except / or . using a negated character class [^/\n.]+
^(?:([^/\n.]+)/)?([a-zA-Z]+\.[a-zA-Z]+).*$
Regex demo
Or you can restrict the characters to match word characters \w+ only
^(?:(\w+)/)?([a-zA-Z]+\.[a-zA-Z]+).*$
Regex demo
Note that in your pattern, the .+ at the end matches as least a single character. If you want to make that part optional, you can change it to .*

Extracting last word from each line using regex

I would like to extract the last word of each line using regex. Most of the last words are built up like this:
sfdsa AAAAB3NzaCLkc3M
gadsgadsg AAAB3NzaCl/Ezfl
dogjasdpgpds AAAB3Nza/ClBAm+4lj
I already tried:
lastwords = re.findall(r'\s(\w+)$', content, re.MULTILINE)
You need to try that:
\s*([\S]+)$
Regex 101 Demo
Explanation:
\s* zero or more whitespace characters
[\S]+ followed by one or more non whitespace characters
$ followed by end of line.
By that way, you are guaranteed to match the last occurance of whitespace characters as that will be followed by no further whitespace characters.
The reason behind your regex did not work because \w+ only covers A-Za-z0-9_
So, / doesn't match in two of your example.

Regex Pattern For Finding String Before First Dot in Python

I need a regex pattern to grab the string before the first dot:
google.com.com
yahoo.com
192.168.1.4
I need a regex that gives google and yahoo, but my pattern grabs IP addresses too. My regex is r'(.*)\.(.*)
Any advice would be appreciated.
I'm assuming that the text you provided is one string with newlines. In that case, this would be a quick solution to pull the results into a list.
re.findall(r'^\s*([a-zA-Z]+)\.', str, re.MULTILINE)
The re.MULTILINE simplifies things by allowing the use of '^' to match the beginning of each line.
This results in ['google', 'yahoo'].
For an explanation of the regex, see the verbose version below. (also good for documenting purposes)
re.findall(r'''
^ # beginning of each line (multiline mode)
\s* # zero or more whitespace characters
([a-zA-Z]+) # captures one or more characters a-z case-insensitive (just in case)
\. # matches '.'
''', str, re.MULTILINE | re.VERBOSE)
I wasn't sure if the spaces on the beginning of the rows have any meaning, so I took them in account in this regex:
(?:^|(?<=\s))[\w-]+(?=.[a-zA-Z])
It matches substrings right after either the beginning of the row, or a space, that are followed by a dot and then a letter (as opposed to a digit).

How to allow regular expression to return empty string

I have a series of text files to parse which may or may not contain any one of a collection of headers, and then lines of data or comment below that header. All header groups are preceded by a double line break.
I am seeking a regular expression that will return an empty string if it sees a header followed immediately by a double line break. I need to differentiate whether a document has that header with no content, or does not have that header at all.
For example, here are portions of two documents:
Dogs
Spaniel
Beagle
Birds
Parrot
and
Dogs
Amphibians
Frogs
Salamanders
I would like a regex that would return Spaniel\nBeagle in the first document, and an empty string for the second.
The closest I have been able to find is (in Python syntax) expr = re.compile("Dogs(.+?|)?\n\n, re.DOTALL). This returns the correct value for the first, but in the second case it returns \n\nAmphibians\nFrogs\nSalamanders. The second question mark and the pipe do not do what I had hoped.
I am handling this by program logic right now, searching for Dogs\n\n and only returning contents if that regex is not found, but it is unsatisfying because nothing beats the feeling of a single regular expression doing the job.
So: is there a regex that will match the second document, and return ""?
Problem
Your Dogs(.+?|)?\n\n pattern matches the word Dogs anywhere in the document, then tries to optionally (as there is an empty alternative |)) match any 1 or more (due to +? quantifier) characters, but as few as possible (since +? is a lazy quantifier), up to the first 2 newlines.
That means, the regex either matches Dogs only if there are no double newline symbols somewhere further in the text, or it will grab any text there is up to the first double newline symbols, because the .+? will consume 1 newline, and the \n\n pattern part will not be able to find the 2 newlines after Dogs.
Solution
You may use a *? quantifier instead of +? one to allow matching zero or more characters. The Dogs(.*?)\n\n will find Dogs, any 0+ chars as few as possible, up to the first \n\n, even those that appear right after Dogs.
Optimization:
If you process very long strings, and if the Dogs appear at the beginning of a line, you may use an unrolled regex since .*? is known to slow regex execution with longer inputs.
Use
expr = re.compile(r"^Dogs(.*(?:\n(?!\n).*)*)", re.MULTILINE)
See the regex demo
Basically, it will match
^ - start of a line
Dogs - Dogs substring
(.*(?:\n(?!\n).*)*) - Group 1 capturing:
.* - zero or more chars other than linebreak chars (as the re.DOTALL modifier is not used)
(?:\n(?!\n).*)* - zero or more sequences of:
\n(?!\n) - a newline not followed with another newline
.* - zero or more chars other than linebreak chars

match until a certain pattern using regex

I have string in a text file containing some text as follows:
txt = "java.awt.GridBagLayout.layoutContainer"
I am looking to get everything before the Class Name, "GridBagLayout".
I have tried something the following , but I can't figure out how to get rid of the "."
txt = re.findall(r'java\S?[^A-Z]*', txt)
and I get the following: "java.awt."
instead of what I want: "java.awt"
Any pointers as to how I could fix this?
Without using capture groups, you can use lookahead (the (?= ... ) business).
java\s?[^A-Z]*(?=\.[A-Z]) should capture everything you're after. Here it is broken down:
java //Literal word "java"
\s? //Match for an optional space character. (can change to \s* if there can be multiple)
[^A-Z]* //Any number of non-capital-letter characters
(?=\.[A-Z]) //Look ahead for (but don't add to selection) a literal period and a capital letter.
Make your pattern match a period followed by a capital letter:
'(java\S?[^A-Z]*?)\.[A-Z]'
Everything in capture group one will be what you want.
This seems to do what you want with re.findall(): (java\S?[^A-Z]*)\.[A-Z]

Categories