Python regex needed for format: 'delete([any text here])' - python

I am a total regex beginner. I want to create a regular expression that strictly allows the word delete followed by two closed parenthesis that contain any kind of characters (http://www.waynesworld1.com).
If I put it all together, it should accept the following: delete(http://www.waynesworld123.com).
Let me emphasize that the regex should strictly accept delete() and shouldn't accept elete(). As long as the user types in delete() anything is acceptable within the parenthesis (example: this would be fine delete(12!#Ww)
How can I craft this regex in Python? So far all I have is /delete/ for my regex.

Here you go:
^delete\(.*\)$
^ assert position at start of the string
delete matches the characters delete literally (case sensitive)
\( matches the character ( literally
.* matches any character (except newline)
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\) matches the character ) literally
$ assert position at end of the string
Here is some Python test code:
import re
txt= {"delete(http://www.waynesworld123.com)",
"delete(12!#Ww)",
"elete(test)",
"delete[test]",
"test"}
pattern=re.compile('^delete\(.*\)$', re.DOTALL)
for line in txt:
if pattern.search(line):
print 'PASS', line
else:
print 'FAIL',line

Related

Multiline regex in pdf file

I am interested in extracting some information from some PDF files that look like this. I only need the information at pages 2 and after which looks like this:
(U) country: On [date] [text]. (text in brackets)
This means it always starts with a number a dot a country and finishes with brackets which brackets may also go to the next line too.
My implementation in python is the following:
use pdfminer extract_text function to get the whole text.
Then use re.findall function in the whole text using this regex ^\d{1,2}\. \(u\) \w+.\w*.\w*:.* on \d{1,2} \w+.*$ with the re.MULTILINE option too.
I have noticed that this extracts the first line of all the paragraphs that I am interested in, but I cannot find a way to grab everything until the end of the paragraph which is the brackets (.*).
I was wondering if anyone can provide some help into this. I was hoping I can match this by only one regex. Otherwise I might try split it by line and iterate through each one.
Thanks in advance.
You could update the pattern using a negated character class matching until the first occurrence of : and then match at least on after it.
To match all following line, you can match a newline and assert that the nextline does not contain only spaces followed by a newline using a negative lookahead.
Using a case insensitive match:
^\d{1,2}\.\s\(u\)\s[^:\n]*:.*?\son\s\d{1,2}\s.*(?:\n(?![^\S\r\n]*\n).*)*
The pattern matches:
^ Start of string
\d{1,2}\.\s\(u\)\s Match 2 digits, . a whitespace char and (u)
[^:\n]*: Match any char except : or a newline, then match :
.*?\son\s Match the first occurrence of on between whitespace chars
\d{1,2}\s Match 1-2 digits and a whitespace char
.* Match the rest of the line
(?: Non capture group
\n(?![^\S\r\n]*\n).* Match a newline, and assert not only spaces followed by a newline
)* Close non capture group and optionally repeat
Regex demo
For example
pattern = r"^\d{1,2}\.\s\(u\)\s[^:]*:.*?\son\s\d{1,2}\s.*(?:\n(?![^\S\r\n]*\n).*)*"
print(re.findall(pattern, extracted_text, re.M | re.I))

Python path regex optional match

I have path strings like these two:
tree/bee.horse_2021/moose/loo.se
bee.horse_2021/moose/loo.se
bee.horse_2021/mo.ose/loo.se
The path can be arbitrarily long after moose. Sometimes the first part of the path such as tree/ is missing, sometimes not. I want to capture tree in the first group if it exists and bee.horse in the second.
I came up with this regex, but it doesn't work:
path_regex = r'^(?:(.*)/)?([a-zA-Z]+\.[a-zA-Z]+).+$'
What am I missing here?
You can restrict the characters to be matched in the first capture group.
For example, you could match any character except / or . using a negated character class [^/\n.]+
^(?:([^/\n.]+)/)?([a-zA-Z]+\.[a-zA-Z]+).*$
Regex demo
Or you can restrict the characters to match word characters \w+ only
^(?:(\w+)/)?([a-zA-Z]+\.[a-zA-Z]+).*$
Regex demo
Note that in your pattern, the .+ at the end matches as least a single character. If you want to make that part optional, you can change it to .*

Regular Expression in Python strings

I want to validate a string that satisfies the below three conditions using regular expression
The special characters allowed are (. , _ , - ).
Should contain only lower-case characters.
Should not start or end with special character.
To satisfy the above conditions, I have created a format as below
^[^\W_][a-z\.,_-]+
This pattern works fine up to second character. However, this pattern is failing for the 3rd and subsequent characters if those contains any special character or upper cases characters.
Example:
Pattern Works for the string S#yanthan but not for Sa#yanthan. I am expecting that pattern to pass even if the third and subsequent characters contains any special characters or upper case characters. Can you suggest me where this pattern goes wrong please? Below is the snippet of the code.
import re
a = "Sayanthan"
exp = re.search("^[^\W_][a-z\.,_-]+",a)
if exp:
print(True)
else:
print(False)
Based on you initial rules I'd go with:
^[a-z](?:[.,_-]*[a-z])*$
See the online demo.
However, you mentioned in the comments:
"Also the third condition is "should not start with Special character" instead of "should not start or end with Special character""
In that case you could use:
^[a-z][-.,_a-z]*$
See the online demo
The pattern that you tried ^[^\W_][a-z.,_-]+ starts with [^\W_] which will match any word char except an underscore, so it could also be an uppercase char.
Then [a-z.,_-]+ will match 1+ times any of the listed, which means the string can also end with a comma for example.
Looking at the conditions listed, you could use:
^[a-z](?:[a-z.,_-]*[a-z])?\Z
^ Start of string
[a-z] Match a lower case char a-z
(?: Non capture group
[a-z.,_-]*[a-z] Match 0+ occurrences of the listed ending with a-z
)? Close group and make it optional
\Z End of string
Regex demo

Python regex specific word with singe quote at end

Searching a large syslog repo and need to get a specific word to match with a certain condition.
I'm using regex to compile a search for this word. I've read the python docs on regex characters and I understand how to specify each criteria separately but somehow missing how to concatenate all together for my specific search. This is what I have so far but not working...
p = re.compile("^'[A-Z]\w+'$")
match = re.search(p, syslogline, )
the word is a username that can be alphanum, always beginning with an uppercase character (preceded by blank space), can contain chars or nums, is 3-12 in length and ends with single quote.
an example would be: Epresley01' or J98473'
Brief
Based on your requirements (also stated below), your regex doesn't work because:
^' Asserts the position at the start of the line and ensures a ' is the first character of that line.
$ Asserts the position at the end of the line.
Having said that you specify that it's preceded by a space character (which isn't present in your pattern). You pattern also checks for ' which isn't the first character of the username. Given that you haven't actually given us a sample of your file I can't confirm nor deny that your string starts before the username and ends after it, but if that's not the case the anchors ^$ are also not helping you here.
Requirements
The requirements below are simply copied from the OP's question (rewritten) to outline the username format. The username:
Is preceded by a space character.
Starts with an uppercase letter.
Contains chars or nums. I'm assuming here that chars actually means letters and that all letters in the username (including the uppercase starting character) are ASCII.
Is 3-12 characters in length (excluding the preceding space and the end character stated below).
Ends with an apostrophe character '.
Code
See regex in use here
(?<= )[A-Z][^\W_]{2,11}'
Explanation
(?<= ) Positive lookbehind ensuring what precedes is a space character
[A-Z] Match any uppercase ASCII letter
[^\W_]{2,11} Match any word character except underscore _ (equivalent to a-zA-Z0-9)
This appears a little confusing because it's actually a double-negative. It's saying match anything that's not in the set. The \W matches any non-word character. Since it's a double-negative, it's like saying don't match non-word characters. Adding _ to the set negates it.
' Match the apostrophe character ' literally
I think you can do it like this:
(Updated after the comment from #ctwheels)
See regex in use here
[A-Z][a-zA-Z0-9]{1,10}'
Explanation
Match a whitespace
Match an uppercase character [A-Z]
Match [a-zA-Z0-9]+
Match an apostrophe '
Demo

python regex match and replace escaped characters with in quotes

Given a string say
test = '''my name\t is "zyb\org"''';
I would like to match the "\o" character occuring with in the double quotes and replace it with "o". I am struggling with a suitable way to do it. Please help.
I understand how to match the double quotes using
"\"(.+)\""g
but writing an embedded regexp to identify the escape character is where I am facing issues.
You can go with:
".*(\\.).*"gim
Which does:
"".*(\\.).*""gim
" matches the characters " literally
.* matches any character (except newline)
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
1st Capturing group (\\.)
\\ matches the character \ literally
. matches any character (except newline)
.* matches any character (except newline)
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
" matches the characters " literally
g modifier: global. All matches (don't return on first match)
i modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z])
m modifier: multi-line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
And here's the LiveDemo
Here's a way to do it:
import re
line = '''my name\t is "zyb\org"'''
replaced = re.sub(r'"(.*)\\(.*)"', r"\1\2", line)
print(replaced)

Categories