How to extract specific pattern in Python

How to extract specific pattern in Python - python

I am new to python
I want to extract specific pattern in python 3.5
pattern: digit character digit
characters can be * + - / x X
how can I do this?
I have tried to use pattern [0-9\*/+-xX\0-9] but it returns either of the characters present in a string.
example: 2*3 or 2x3 or 2+3 or 2-3 should be matched
but asdXyz should not

You may use
[0-9][*+/xX-][0-9]
Or to match a whole string:
^[0-9][*+/xX-][0-9]$
In Python 3.x, you may discard the ^ (start of string anchor) and $ (end of string anchor) if you use the pattern in re.fullmatch (demo):
if re.fullmatch(r'[0-9][*+/xX-][0-9]', '5+5'):
print('5+5 string found!')
if re.fullmatch(r'[0-9][*+/xX-][0-9]', '5+56'):
print('5+56 string found!')
# => 5+5 string found!

The re.match() function will limit the search to the start and end of the string to prevent false positives.
A digit can be matched with \d.
The operator can be matched with [x/+\-] which matches exactly one of x, /, +, or - (which is escaped because it is a special regex character).
The last digit can be matched with \d.
Putting parentheses around each part allows the parts to extracted as separate subgroups.
For example:
>>> re.match(r'(\d)([x/+\-])(\d)', '3/4').groups()
('3', '/', '4')

Related

How to search for at least one of two groups in Python Regex, when also looking for a third group that is a must?

I'm trying to use regex in order to find missused operators in my program.
Specifically I'm trying to find whether some operators (say %, $ and #) were used without digits flanking their both sides.
Here are some examples of missuse:
'5%'
'%5'
'5%+3'
'5%%'
Is there a way to to that with a single re.search?
I know I can use + for at least one, or * for at least zero,
but looking at:
([^\d]*)(%)([^\d]\*)
I would like to find cases where at least one of group(1) and group(3) exist,
since inserting % with digits on its both sides is a good use of the operator.
I know I could use:
match = re.search(r'[^\d\.]+[#$%]', user_request)
if match:
return 'Illegal use of match.group()'
match = re.search(r'[#$%][^\d\.]+', user_request)
if match:
return 'Illegal use of match.group()'
But I would prefer to do so with a single re.search line.
And also - when I use [^\d.] does this include the beginning the end of the string? Or only different chars?
Thank you :)

You might use an alternation with a negative lookahead and a negative lookbehind to assert what is before and what is after is not a digit:
(?<!\d)[#$%]|[#$%](?!\d)
That will match:
(?<!\d) Negative lookbehind to check what is on the left is not a digit
[#$%] Character class, match one of #, $ or %
| Or
[#$%] Character class, match one of #, $ or %
(?!\d) Negative lookahead to check what is on the right is not a digit
For example:
match = re.search(r'(?<!\d)[#$%]|[#$%](?!\d)', user_request)
if match:
return 'Illegal use of match.group()'
Regex demo | Python demo
[^\d.] Matches not a digit or a literal dot. The ^ inside a character class negates what it contains. But if it is the first character of the string that is not a digit or a dot then it will match.

Regex to check if it is exactly one single word

I am basically trying to match string pattern(wildcard match)
Please carefully look at this -
*(star) - means exactly one word .
This is not a regex pattern...it is a convention.
So,if there patterns like -
*.key - '.key.' is preceded by exactly one word(word containing no dots)
*.key.* - '.key.' is preceded and succeeded by exactly one word having no dots
key.* - '.key' preceeds exactly one word .
So,
"door.key" matches "*.key"
"brown.door.key" doesn't match "*.key".
"brown.key.door" matches "*.key.*"
but "brown.iron.key.door" doesn't match "*.key.*"
So, when I encounter a '*' in pattern, I have replace it with a regex so that it means it is exactly one word.(a-zA-z0-9_).Can anyone please help me do this in python?

To convert your pattern to a regexp, you first need to make sure each character is interpreted literally and not as a special character. We can do that by inserting a \ in front of any re special character. Those characters can be obtained through sre_parse.SPECIAL_CHARS.
Since you have a special meaning for *, we do not want to escape that one but instead replace it by \w+.
Code
import sre_parse
def convert_to_regexp(pattern):
special_characters = set(sre_parse.SPECIAL_CHARS)
special_characters.remove('*')
safe_pattern = ''.join(['\\' + c if c in special_characters else c for c in pattern ])
return safe_pattern.replace('*', '\\w+')
Example
import re
pattern = '*.key'
r_pattern = convert_to_regexp(pattern) # '\\w+\\.key'
re.match(r_pattern, 'door.key') # Match
re.match(r_pattern, 'brown.door.key') # None
And here is an example with escaped special characters
pattern = '*.(key)'
r_pattern = convert_to_regexp(pattern) # '\\w+\\.\\(key\\)'
re.match(r_pattern, 'door.(key)') # Match
re.match(r_pattern, 'brown.door.(key)') # None
Sidenote
If you intend looking for the output pattern with re.search or re.findall, you might want to wrap the re pattern between \b boundary characters.

The conversion rules you are looking for go like this:
* is a word, thus: \w+
. is a literal dot: \.
key is and stays a literal string
plus, your samples indicate you are going to match whole strings, which in turn means your pattern should match from the ^ beginning to the $ end of the string.
Therefore, *.key becomes ^\w+\.key$, *.key.* becomes ^\w+\.key\.\w+$, and so forth..
Online Demo: play with it!

^ means a string that starts with the given set of characters in a regular expression.
$ means a string that ends with the given set of characters in a regular expression.
\s means a whitespace character.
\S means a non-whitespace character.
+ means 1 or more characters matching given condition.
Now, you want to match just a single word meaning a string of characters that start and end with non-spaced string. So, the required regular expression is:
^\S+$

You could do it with a combination of "any characters that aren't period" and the start/end anchors.
*.key would be ^[^.]*\.key, and *.key.* would be ^[^.]*\.key\.[^.]*$
EDIT: As tripleee said, [^.]*, which matches "any number of characters that aren't periods," would allow whitespace characters (which of course aren't periods), so using \w+, "any number of 'word characters'" like the other answers is better.

python regex: get end digits from a string

I am quite new to python and regex (regex newbie here), and I have the following simple string:
s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
I would like to extract only the last digits in the above string i.e 767980716 and I was wondering how I could achieve this using python regex.
I wanted to do something similar along the lines of:
re.compile(r"""-(.*?)""").search(str(s)).group(1)
indicating that I want to find the stuff in between (.*?) which starts with a "-" and ends at the end of string - but this returns nothing..
I was wondering if anyone could point me in the right direction..
Thanks.

You can use re.match to find only the characters:
>>> import re
>>> s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
>>> re.match('.*?([0-9]+)$', s).group(1)
'767980716'
Alternatively, re.finditer works just as well:
>>> next(re.finditer(r'\d+$', s)).group(0)
'767980716'
Explanation of all regexp components:
.*? is a non-greedy match and consumes only as much as possible (a greedy match would consume everything except for the last digit).
[0-9] and \d are two different ways of capturing digits. Note that the latter also matches digits in other writing schemes, like ୪ or ൨.
Parentheses (()) make the content of the expression a group, which can be retrieved with group(1) (or 2 for the second group, 0 for the whole match).
+ means multiple entries (at least one number at the end).
$ matches only the end of the input.

Nice and simple with findall:
import re
s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
print re.findall('^.*-([0-9]+)$',s)
>>> ['767980716']
Regex Explanation:
^ # Match the start of the string
.* # Followed by anthing
- # Upto the last hyphen
([0-9]+) # Capture the digits after the hyphen
$ # Upto the end of the string
Or more simply just match the digits followed at the end of the string '([0-9]+)$'

Your Regex should be (\d+)$.
\d+ is used to match digit (one or more)
$ is used to match at the end of string.
So, your code should be: -
>>> s = "99-my-name-is-John-Smith-6376827-%^-1-2-767980716"
>>> import re
>>> re.compile(r'(\d+)$').search(s).group(1)
'767980716'
And you don't need to use str function here, as s is already a string.

Use the below regex
\d+$
$ depicts the end of string..
\d is a digit
+ matches the preceding character 1 to many times

Save the regular expressions for something that requires more heavy lifting.
>>> def parse_last_digits(line): return line.split('-')[-1]
>>> s = parse_last_digits(r"99-my-name-is-John-Smith-6376827-%^-1-2-767980716")
>>> s
'767980716'

I have been playing around with several of these solutions, but many seem to fail if there are no numeric digits at the end of the string. The following code should work.
import re
W = input("Enter a string:")
if re.match('.*?([0-9]+)$', W)== None:
last_digits = "None"
else:
last_digits = re.match('.*?([0-9]+)$', W).group(1)
print("Last digits of "+W+" are "+last_digits)

Try using \d+$ instead. That matches one or more numeric characters followed by the end of the string.

What does "?:" mean in a Python regular expression?

Below is the Python regular expression. What does the ?: mean in it? What does the expression do overall? How does it match a MAC address such as "00:07:32:12:ac:de:ef"?
re.compile(([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5}), string)

It (?:...) means a set of non-capturing grouping parentheses.
Normally, when you write (...) in a regex, it 'captures' the matched material. When you use the non-capturing version, it doesn't capture.
You can get at the various parts matched by the regex using the methods in the re package after the regex matches against a particular string.
How does this regular expression match MAC address "00:07:32:12:ac:de:ef"?
That's a different question from what you initially asked. However, the regex part is:
([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5})
The outer most pair of parentheses are capturing parentheses; what they surround will be available when you use the regex against a string successfully.
The [\dA-Fa-f]{2} part matches a digit (\d) or the hexadecimal digits A-Fa-f], in a pair {2}, followed by a non-capturing grouping where the matched material is a colon or dash (: or -), followed by another pair of hex digits, with the whole repeated exactly 5 times.
p = re.compile(([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5}))
m = p.match("00:07:32:12:ac:de:ef")
if m:
m.group(1)
The last line should print the string "00:07:32:12:ac:de" because that is the first set of 6 pairs of hex digits (out of the seven pairs in total in the string). In fact, the outer grouping parentheses are redundant and if omitted, m.group(0) would work (it works even with them). If you need to match 7 pairs, then you change the 5 into a 6. If you need to reject them, then you'd put anchors into the regex:
p = re.compile(^([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5})$)
The caret ^ matches the start of string; the dollar $ matches the end of string. With the 5, that would not match your sample string. With 6 in place of 5, it would match your string.

Using ?: as in (?:...) makes the group non-capturing during replace. During find it does'nt make any sense.
Your RegEx means
r"""
( # Match the regular expression below and capture its match into backreference number 1
[\dA-Fa-f] # Match a single character present in the list below
# A single digit 0..9
# A character in the range between “A” and “F”
# A character in the range between “a” and “f”
{2} # Exactly 2 times
(?: # Match the regular expression below
[:-] # Match a single character present in the list below
# The character “:”
# The character “-”
[\dA-Fa-f] # Match a single character present in the list below
# A single digit 0..9
# A character in the range between “A” and “F”
# A character in the range between “a” and “f”
{2} # Exactly 2 times
){5} # Exactly 5 times
)
"""
Hope this helps.

It does not change the search process. But it affects the retrieval of the group after the match has been found.
For example:
Text:
text = 'John Wick'
pattern to find:
regex = re.compile(r'John(?:\sWick)') # here we are looking for 'John' and also for a group (space + Wick). the ?: makes this group unretrievable.
When we print the match - nothing changes:
<re.Match object; span=(0, 9), match='John Wick'>
But if you try to manually address the group with (?:) syntax:
res = regex.finditer(text)
for i in res:
print(i)
print(i.group(1)) # here we are trying to retrieve (?:\sWick) group
it gives us an error:
IndexError: no such group
Also, look:
Python docs:
(?:...)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
the link to the re page in docs:
https://docs.python.org/3/library/re.html

(?:...) means a non cature group. The group will not be captured.

Searching an input string for occurences of integers and characters using a single regular expression in Python

I have an input string which is considered valid only if it contains:
At least one character in [a-z]
At least one integer in [0-9], and
At least one character in [A-Z]
There is no constraint on the order of occurrence of any of the above. How can I write a single regular expression that validates my input string ?

Try this
^(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9]).*$
See it here online on Regexr
The ^ and $ are anchors which bind the pattern to the start and the end of the string.
The (?=...) are lookahead assertions. they check if the pattern after the = is ahead but they don't match it. So to match something there needs to be a real pattern also. Here it is the .* at the end.
The .* would match the empty string also, but as soon as one of the lookaheads fail, the complete expression will fail.
For those who are concerned about the readability and maintainability, use the re.X modifier to allow pretty and commented regexes:
reg = re.compile(r'''
^ # Match the start of the string
(?=.*[a-z]) # Check if there is a lowercase letter in the string
(?=.*[A-Z]) # Check if there is a uppercase letter in the string
(?=.*[0-9]) # Check if there is a digit in the string
.* # Match the string
$ # Match the end of the string
'''
, re.X) # eXtented option whitespace is not part of he pattern for better readability

Do you need regular expression?
import string
if any(c in string.uppercase for c in t) and any(c in string.lowercase for c in t) and any(c in string.digits for c in t):
or an improved version of #YuvalAdam's improvement:
if all(any(c in x for c in t) for x in (string.uppercase, string.lowercase, string.digits)):

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract specific pattern in Python - python

Related

How to search for at least one of two groups in Python Regex, when also looking for a third group that is a must?

Regex to check if it is exactly one single word

python regex: get end digits from a string

What does "?:" mean in a Python regular expression?

Searching an input string for occurences of integers and characters using a single regular expression in Python

Categories

Resources