Extracting last word from each line using regex - python

I would like to extract the last word of each line using regex. Most of the last words are built up like this:
sfdsa AAAAB3NzaCLkc3M
gadsgadsg AAAB3NzaCl/Ezfl
dogjasdpgpds AAAB3Nza/ClBAm+4lj
I already tried:
lastwords = re.findall(r'\s(\w+)$', content, re.MULTILINE)

You need to try that:
\s*([\S]+)$
Regex 101 Demo
Explanation:
\s* zero or more whitespace characters
[\S]+ followed by one or more non whitespace characters
$ followed by end of line.
By that way, you are guaranteed to match the last occurance of whitespace characters as that will be followed by no further whitespace characters.
The reason behind your regex did not work because \w+ only covers A-Za-z0-9_
So, / doesn't match in two of your example.

Related

Multiline regex in pdf file

I am interested in extracting some information from some PDF files that look like this. I only need the information at pages 2 and after which looks like this:
(U) country: On [date] [text]. (text in brackets)
This means it always starts with a number a dot a country and finishes with brackets which brackets may also go to the next line too.
My implementation in python is the following:
use pdfminer extract_text function to get the whole text.
Then use re.findall function in the whole text using this regex ^\d{1,2}\. \(u\) \w+.\w*.\w*:.* on \d{1,2} \w+.*$ with the re.MULTILINE option too.
I have noticed that this extracts the first line of all the paragraphs that I am interested in, but I cannot find a way to grab everything until the end of the paragraph which is the brackets (.*).
I was wondering if anyone can provide some help into this. I was hoping I can match this by only one regex. Otherwise I might try split it by line and iterate through each one.
Thanks in advance.
You could update the pattern using a negated character class matching until the first occurrence of : and then match at least on after it.
To match all following line, you can match a newline and assert that the nextline does not contain only spaces followed by a newline using a negative lookahead.
Using a case insensitive match:
^\d{1,2}\.\s\(u\)\s[^:\n]*:.*?\son\s\d{1,2}\s.*(?:\n(?![^\S\r\n]*\n).*)*
The pattern matches:
^ Start of string
\d{1,2}\.\s\(u\)\s Match 2 digits, . a whitespace char and (u)
[^:\n]*: Match any char except : or a newline, then match :
.*?\son\s Match the first occurrence of on between whitespace chars
\d{1,2}\s Match 1-2 digits and a whitespace char
.* Match the rest of the line
(?: Non capture group
\n(?![^\S\r\n]*\n).* Match a newline, and assert not only spaces followed by a newline
)* Close non capture group and optionally repeat
Regex demo
For example
pattern = r"^\d{1,2}\.\s\(u\)\s[^:]*:.*?\son\s\d{1,2}\s.*(?:\n(?![^\S\r\n]*\n).*)*"
print(re.findall(pattern, extracted_text, re.M | re.I))

Ignoring a word in regex (negative lookahead)

I'm looking to try and ignore a word in regex, but the solutions I've seen here did not work correctly for me.
Regular expression to match a line that doesn't contain a word
The issue I'm facing is I have an existing regex:
(?P<MovieCode>[A-Za-z]{3,6}-\d{3,5})(?P<MoviePart>[A-C]{1}\b)?
That is matching on Deku-041114-575-boku.mp4.
However, I want this regex to fail to match for cases where the MovieCode group has Deku in it.
I tried
(?P<MovieCode>(?!Deku)[A-Za-z]{3,6}-\d{3,5})(?P<MoviePart>[A-C]{1}\b)?
but unfortunately it just matches eku-124 and I need it to fail.
I have a regex101 with my attempts.
https://regex101.com/r/xqALM2/2
The MovieClose group can match 3-6 chars A-Z and Deku has 4 chars. If that part should not contain Deku, you could use the negative lookahead predeced by repeating 0+ times a character class [A-Za-z]* as it can not cross the -.
To prevent matching eku-124, you could prepend a word boundary before the MovieClose group or add (?<!\S if there should be a whitespace boundary at the left.
Note that you can omit {1} from the pattern.
\b(?P<MovieCode>(?![A-Za-z]*Deku)[A-Za-z]{3,6}-\d{3,5})(?P<MoviePart>[A-C]\b)?
Regex demo

Regex Pattern For Finding String Before First Dot in Python

I need a regex pattern to grab the string before the first dot:
google.com.com
yahoo.com
192.168.1.4
I need a regex that gives google and yahoo, but my pattern grabs IP addresses too. My regex is r'(.*)\.(.*)
Any advice would be appreciated.
I'm assuming that the text you provided is one string with newlines. In that case, this would be a quick solution to pull the results into a list.
re.findall(r'^\s*([a-zA-Z]+)\.', str, re.MULTILINE)
The re.MULTILINE simplifies things by allowing the use of '^' to match the beginning of each line.
This results in ['google', 'yahoo'].
For an explanation of the regex, see the verbose version below. (also good for documenting purposes)
re.findall(r'''
^ # beginning of each line (multiline mode)
\s* # zero or more whitespace characters
([a-zA-Z]+) # captures one or more characters a-z case-insensitive (just in case)
\. # matches '.'
''', str, re.MULTILINE | re.VERBOSE)
I wasn't sure if the spaces on the beginning of the rows have any meaning, so I took them in account in this regex:
(?:^|(?<=\s))[\w-]+(?=.[a-zA-Z])
It matches substrings right after either the beginning of the row, or a space, that are followed by a dot and then a letter (as opposed to a digit).

Python regex: How to make a group of words/character optional?

I am trying to make regex that can match all of them:
word
word-hyphen
word-hyphen-again
that is -\w+could be many depends on words in a term. How can I make it optional
Thing I made so far is given here:- https://regex101.com/r/Atpwze/1
Try using
\w+(-\w+)* for matching 0 or more hyphenated words after first word
\w+(-\w+){0,} same as first case
based on your exact requirement.
In order to eliminate some extreme cases like a-+-+---, you could use \w+(-\w+)*[^\W]
\W matches all non-word characters and ^\W negates the matching of non-word characters
To catch all of your examples, I think you could use:
^\w+(?:\w+\-?|\-\w+)+$
Beginning of the string ^
Match a word character one or more times \w+
Start a non capturing group (?:
Match a word character one or more times with an optional hyphen \w+\-?
Or |
A hyphen with one or more word characters \-\w+
Close the non capturing group )
End of the string $

Python Regex for hyphenated words

I'm looking for a regex to match hyphenated words in Python.
The closest I've managed to get is: '\w+-\w+[-w+]*'
text = "one-hundered-and-three- some text foo-bar some--text"
hyphenated = re.findall(r'\w+-\w+[-\w+]*',text)
which returns list ['one-hundered-and-three-', 'foo-bar'].
This is almost perfect except for the trailing hyphen after 'three'. I only want the additional hyphen if followed by a 'word'. i.e. instead of the '[-\w+]\*' I need something like '(-\w+)*' which I thought would work, but doesn't (it returns ['-three, '']). i.e. something that matches |word followed by hyphen followed by word followed by hyphen_word zero or more times|.
Try this:
re.findall(r'\w+(?:-\w+)+',text)
Here we consider a hyphenated word to be:
a number of word chars
followed by any number of:
a single hyphen
followed by word chars

Categories