Regex Pattern For Finding String Before First Dot in Python - python

I need a regex pattern to grab the string before the first dot:
google.com.com
yahoo.com
192.168.1.4
I need a regex that gives google and yahoo, but my pattern grabs IP addresses too. My regex is r'(.*)\.(.*)
Any advice would be appreciated.

I'm assuming that the text you provided is one string with newlines. In that case, this would be a quick solution to pull the results into a list.
re.findall(r'^\s*([a-zA-Z]+)\.', str, re.MULTILINE)
The re.MULTILINE simplifies things by allowing the use of '^' to match the beginning of each line.
This results in ['google', 'yahoo'].
For an explanation of the regex, see the verbose version below. (also good for documenting purposes)
re.findall(r'''
^ # beginning of each line (multiline mode)
\s* # zero or more whitespace characters
([a-zA-Z]+) # captures one or more characters a-z case-insensitive (just in case)
\. # matches '.'
''', str, re.MULTILINE | re.VERBOSE)

I wasn't sure if the spaces on the beginning of the rows have any meaning, so I took them in account in this regex:
(?:^|(?<=\s))[\w-]+(?=.[a-zA-Z])
It matches substrings right after either the beginning of the row, or a space, that are followed by a dot and then a letter (as opposed to a digit).

Related

Multiline regex in pdf file

I am interested in extracting some information from some PDF files that look like this. I only need the information at pages 2 and after which looks like this:
(U) country: On [date] [text]. (text in brackets)
This means it always starts with a number a dot a country and finishes with brackets which brackets may also go to the next line too.
My implementation in python is the following:
use pdfminer extract_text function to get the whole text.
Then use re.findall function in the whole text using this regex ^\d{1,2}\. \(u\) \w+.\w*.\w*:.* on \d{1,2} \w+.*$ with the re.MULTILINE option too.
I have noticed that this extracts the first line of all the paragraphs that I am interested in, but I cannot find a way to grab everything until the end of the paragraph which is the brackets (.*).
I was wondering if anyone can provide some help into this. I was hoping I can match this by only one regex. Otherwise I might try split it by line and iterate through each one.
Thanks in advance.
You could update the pattern using a negated character class matching until the first occurrence of : and then match at least on after it.
To match all following line, you can match a newline and assert that the nextline does not contain only spaces followed by a newline using a negative lookahead.
Using a case insensitive match:
^\d{1,2}\.\s\(u\)\s[^:\n]*:.*?\son\s\d{1,2}\s.*(?:\n(?![^\S\r\n]*\n).*)*
The pattern matches:
^ Start of string
\d{1,2}\.\s\(u\)\s Match 2 digits, . a whitespace char and (u)
[^:\n]*: Match any char except : or a newline, then match :
.*?\son\s Match the first occurrence of on between whitespace chars
\d{1,2}\s Match 1-2 digits and a whitespace char
.* Match the rest of the line
(?: Non capture group
\n(?![^\S\r\n]*\n).* Match a newline, and assert not only spaces followed by a newline
)* Close non capture group and optionally repeat
Regex demo
For example
pattern = r"^\d{1,2}\.\s\(u\)\s[^:]*:.*?\son\s\d{1,2}\s.*(?:\n(?![^\S\r\n]*\n).*)*"
print(re.findall(pattern, extracted_text, re.M | re.I))

Python path regex optional match

I have path strings like these two:
tree/bee.horse_2021/moose/loo.se
bee.horse_2021/moose/loo.se
bee.horse_2021/mo.ose/loo.se
The path can be arbitrarily long after moose. Sometimes the first part of the path such as tree/ is missing, sometimes not. I want to capture tree in the first group if it exists and bee.horse in the second.
I came up with this regex, but it doesn't work:
path_regex = r'^(?:(.*)/)?([a-zA-Z]+\.[a-zA-Z]+).+$'
What am I missing here?
You can restrict the characters to be matched in the first capture group.
For example, you could match any character except / or . using a negated character class [^/\n.]+
^(?:([^/\n.]+)/)?([a-zA-Z]+\.[a-zA-Z]+).*$
Regex demo
Or you can restrict the characters to match word characters \w+ only
^(?:(\w+)/)?([a-zA-Z]+\.[a-zA-Z]+).*$
Regex demo
Note that in your pattern, the .+ at the end matches as least a single character. If you want to make that part optional, you can change it to .*

RegEx do not match

string:
"Btw-nummer: NL855162508B01
NL855162508B02
"
Regex code used:
(^((?!NL855162508B01).))([A-Za-z]{2}\d{9}[A-Za-z]\d{2})
Regex do not match:
NL855162508B01
But do match:
NL855162508B02
As seen in this Regexr I have used:
https://regexr.com/5im28
Desired behavior:
match NL855162508B02
Can you guys help?
You were almost there, but this part (?!NL855162508B01). first matches any character except a newline due to the .
You are using 3 capturing groups, which can all be omitted if you need a match only.
To also match the string when it is not directly at the start, you can omit the anchor ^ and use word boundaries \b
\b(?!NL855162508B01\b)[A-Za-z]{2}\d{9}[A-Za-z]\d{2}\b
Regex demo

Regex for parsing uid from URL

I am trying to parse UIDs from URLs. However regex is not something I am good at so seeking for some help.
Example Input:
https://example.com/d/iazs9fEil/somethingelse?foo=bar
Example Output:
iazs9fEil
What I've tried so far is
([/d/]+[\d\x])\w+
Which somehow works, but returns in with the /d/ prefix, so the output is /d/iazs9fEil.
How to change the regex to not contain the /d/ prefix?
EDIT:
I've tried this regex ([^/d/]+[\d\x])\w+ which outputs the correct string which is iazs9fEil, but also returns the rest of the url, so here it is somethingelse?foo=bar
In short, you may use
match = re.search(r'/d/(\w+)', your_string) # Look for a match
if match: # Check if there is a match first
print(match.group(1)) # Now, get Group 1 value
See this regex demo and a regex graph:
NOTE
/ is not any special metacharacter, do not escape it in Python string patterns
([/d/]+[\d\x])\w+ matches and captures into Group 1 any one or more slashes or digits (see [/d/]+, a positive character class) and then a digit or (here, Python shows an error: sre_contants.error incomplete escape \x, probably it could parse it as x, but it is not the case), and then matches 1+ word chars. You put the /d/ into a character class and it stopped matching a char sequence, [/d/]+ matches slashes and digits in any order and amount, and certainly places this string into Group 1.
Try (?<=/d/)[^/]+
Explanation:
(?<=/d/) - positive lookbehind, assure that what's preceeding is /d/
[^/]+ - match one or more characters other than /, so it matches everything until /
Demo
You could use a capturing group:
https?://.*?/d/([^/\s]+)
Regex demo

Extracting last word from each line using regex

I would like to extract the last word of each line using regex. Most of the last words are built up like this:
sfdsa AAAAB3NzaCLkc3M
gadsgadsg AAAB3NzaCl/Ezfl
dogjasdpgpds AAAB3Nza/ClBAm+4lj
I already tried:
lastwords = re.findall(r'\s(\w+)$', content, re.MULTILINE)
You need to try that:
\s*([\S]+)$
Regex 101 Demo
Explanation:
\s* zero or more whitespace characters
[\S]+ followed by one or more non whitespace characters
$ followed by end of line.
By that way, you are guaranteed to match the last occurance of whitespace characters as that will be followed by no further whitespace characters.
The reason behind your regex did not work because \w+ only covers A-Za-z0-9_
So, / doesn't match in two of your example.

Categories