Pulling out valid twitter names using re module in Python - python

1. Background info
I have string which contains valid and invalid twitter user names as such:
#moondra2017.org,#moondra,Python#moondra,#moondra_python
In the above string, #moondra and #moondra_python are valid usernames. The rest are not.
1.1 Goal
By using \b and/or \B as a part of regex pattern, I need to extract the valid usernames.
P.S I must use \b and/or \B as the part of the regex, that is part of this goal.
2. My Failed Attempt
import re
# (in)valid twitter user names
un1 = '#moondra2017.org' # invalid
un2 = '#moondra' # << valid, we want this
un3 = 'Python#moondra' # invalid
un4 = '#moondra_python' # << validwe want this
string23 = f'{un1},{un2},{un3},{un4}'
pattern = re.compile(r'(?:\B#\w+\b(?:[,])|\B#\w+\b)') # ??
print('10:', re.findall(pattern, string23)) # line 10
2.1 Observed: The above code prints:
10: ['#moondra2017', '#moondra,', '#moondra_python'] # incorrect
2.2 Expected:
10: ['#moondra', '#moondra_python'] # correct

I will answer assuming that the mentions are always in the format as shown above, comma-separated.
Then, to match the end of a mention, you need to use a comma boundary, (?![^,]) or a less efficient but online tester friendly (?=,|$).
pattern = re.compile(r'\B#\w+\b(?![^,])')
pattern = re.compile(r'\B#\w+\b(?=,|$)')
See the regex demo and the Python demo
Details
\B - a non-word boundary, there must be start of string or a non-word char immediately to the left of the current location
# - a # char
\w+ - 1+ word chars (letters, digits or _)
\b - a word boundary (the next char should be a non-word char or end of string)
(?![^,]) - the next char cannot be a char different from , (so it should be , or end of string).

Related

how to check input string with list of pattern sequentially in python?

I have specific patterns which composed of string, numbers and special character in specific order. I would like to check input string is in the list of pattern that I created and print error if seeing incorrect input. To do so, I tried of using regex but my code is not neat enough. I am wondering if someone help me with this.
use case
I have input att2_epic_app_clm1_sub_valid, where I split them by _; here is list of pattern I am expecting to check and print error if not match.
Rule:
input should start with att and some number like [att][0-6]*, or [ptt][0-6]; after that it should be continued at either epic or semi, then it should be continued with [app][0-6] or [app][0-6_][clm][0-9_]+[sub|sup]; then it should end with [valid|Invalid]
so I composed this pattern with re but when I passed invalid input, it is not detected and I expect error instead.
import re
acceptable_pattern=re.compile(r'([att]+[0-6_])(epic|semi_)([app]+[0-6_]+[clm]+[0-6_])([sub|sup_])([valid|invalid]))'
input='att1_epic_app2_clm1_sub_valid' # this is valid string
wlist=input.split('_')
for each in wlist:
if any(ext in each for ext in acceptable_pattern):
print("valid")
else:
print("invalid")
this is not quite working because I have to check the string from beginning to end where split the string by _ where each new string much match of of the predefined rule such as:
input string should start with att|ptt which end with between 1-6; then next new word either epic or semi; then it should be app or app1~app6 or app{1_6}clm{1~6}{sub|sup_}; then string end with {valid|invalid};
how should I specify those rules by using re.compile to check pattern in input string and raise error if it is not sequentially? How should we do this in python? any quick way of making this happen?
Instead of using split, you could consider writing a pattern that validates the whole string.
If I am reading the requirements, you might use:
^[ap]tt[0-6]_(?:epic|semi)_app(?:[1-6]|[1-6_]clm[0-9]*_su[bp])?_valid$
^ Start of string
[ap]tt[0-6] match att or ptt and a digit 0-6
_(?:epic|semi) Match _epic or _semi
_app Match literally
(?: Non capture group for the alternation
[1-6] Match a digit 1-6
| Or
[1-6_]clm[0-9]*_su[bp] Match a digit 1-6 or _, then clm followed by optional digit 0-9 and then _sub or _sup
)? Close the non capture group and make it optional
_valid Match literally
$ End of string
See a regex demo.
If the string can also start with dev then you can use an alternation:
^(?:[ap]tt|dev)[0-6]_(?:epic|semi)_app(?:[1-6]|[1-6_]clm[0-9]*_su[bp])?_valid$
See another regex demo.
Then you can check if there was a match:
import re
pattern = r"^(?:[ap]tt|dev)[0-6]_(?:epic|semi)_app(?:[1-6]|[1-6_]clm[0-9]*_su[bp])?_valid$"
strings = [
"att2_epic_app_clm1_sub_valid",
"att12_epic_app_clm1_sub_valid",
"att2_epic_app_valid",
"att2_epic_app_clm1_sub_valid"
]
for s in strings:
m = re.match(pattern, s, re.M)
if m:
print("Valid: " + m.group())
else:
print("Invalid: " + s)
Output
Valid: att2_epic_app_clm1_sub_valid
Invalid: att12_epic_app_clm1_sub_valid
Valid: att2_epic_app_valid
Valid: att2_epic_app_clm1_sub_valid

Regex should fail if pattern is followed by another pattern

I need to detect #username mentions within a message, but NOT if it is in the form of #username[user_id]. I have a regex that can match the #username part, but am struggling to negate the match if it is followed by \[\d\].
import re
username_regex = re.compile(r'#([\w.#-]+[\w])')
usernames = username_regex.findall("Hello #kevin") # correctly finds kevin
usernames = username_regex.findall("Hello #kevin.") # correctly finds kevin
usernames = username_regex.findall("Hello #kevin[1].") # shouldn't find kevin but does
The regex allows for usernames that contain #, . and -, but need to end with a \w character ([a-zA-Z0-9_]). How can I extend the regex so that it fails if the username is followed by the userid in the [1] form?
I tried #([\w.#-]+[\w])(?!\[\d+\]) but then it matches kevi 🤔
I'm using Python 3.10.
You can "emulate" possessive matching with
#(?=([\w.#-]*\w))\1(?!\[\d+\])
See the regex demo.
Details:
# - a # char
(?=([\w.#-]*\w)) - a positive lookahead that matches and captures into Group 1 zero or more word, ., # and - chars, as many as possible, and then a word char immediately to the right of the current position (the text is not consumed, the regex engine index stays at the same location)
\1 - the text matched and captured in Group 1 (this consumes the text captured with the lookahead pattern, mind that backreferences are atomic by nature)
(?!\[\d+\]) - a negative lookahead that fails the match if there is [ + one or more digits + ] immediately to the right of the current location.

First lookahead then look for closest matching capture group behind the lookahead match. RegEx in Python

I have a full text with line separated strings. Lines starting with '%' are titles and lines starting with '>' contain the text I want to look for my my query in. If my query is found I want to return the nearest title above it. Here is the expression I tried myself:
import re
query = "ABCDE"
full_text = "%EFGHI\r>XXXXX\r>XXXXX\r%IWANT\r>XXXXX\r>ABCDE"
re.search("%(.*?)\r(?=>.*{})".format(query), full_text).group(0)
I want this code block to return the string:
> 'IWANT'
As this is the closest title above the query. However, it returns:
> 'EFGHI'
I guess it makes sense, since 'EFGHI' is the first element matching the search pattern. Is there a way to first lookahead for my query and then look back for the nearest title?
I suggest matching all parts with \r>... that have no % after \r before the ABCDE value to get the right title:
r"%([^\r]*)(?=(?:\r(?!%)[^\r]*)*\r>[^\r]*{})".format(query)
See the Python demo
Pattern details:
% - a % char
([^\r]*) - Group 1: zero or more chars other than CR chars
(?=(?:\r(?!%)[^\r]*)*\r>[^\r]*ABCDE) - a positive lookahead that, immediately to the right of the current location, must match the following sequence of patterns:
(?:\r(?!%)[^\r]*)* - 0 or more repetitions of CR not followed with % and then followed with zero or more chars other than CR chars
\r> - a CR char and >
[^\r]* - zero or more chars other than CR chars
ABCDE - a literal char sequence

Python regex to extract tokens

I am trying to find all the tokens which look either like abc_rty or abc_45 or abc09_23k or abc09-K34 or 4535. The tokens shouldn't start with _ or - or numbers.
I am not making any progress and have even lost the progress that I did. This is what I have now:
r'(?<!0-9)[(a-zA-Z)+]_(?=a-zA-Z0-9)|(?<!0-9)[(a-zA-Z)+]-(?=a-zA-Z0-9)\w+'
To make the question more clear here is an example:
If i have a string as follows:
D923-44 43 uou 08*) %%5 89ANB -iopu9 _M89 _97N hi_hello
Then it shall accept
D923-44 and 43 and uou and hi_hello
It should ignore
08*) %%5 89ANB -iopu9 _M89 _97N
I might have missed some cases but i think the text would be enough. Apologies if its not
^(\d+|[A-Za-z][\w_-]*)$
Edit live on Debuggex
split the line with a space delimiter then run this REGEX through the line to filter.
^ is the start of the line
\d means digits [0-9]
+ means one or more
| means OR
[A-Za-z] first character must be a letter
[\w_-]* There can be any alphanumeric _ + character after it or nothing at all.
$ means the end of the line
The flow of the REGEX is shown in the chart I provided, which somewhat explains how it's happening.
However, ill explain basically it checks to see if it's all digits OR it starts with a letter(upper/lower) then after that letter it checks for any alphanumeric _ + character until the end of the line.
This appears to work as desired:
regex = re.compile(r"""
(?<!\S) # Assert there is no non-whitespace before the current character
(?: # Start of non-capturing group:
[^\W\d_] # Match either a letter
[\w-]* # followed by any number of the allowed characters
| # or
\d+ # match a string of digits.
) # End of group
(?!\S) # Assert there is no non-whitespace after the current character""",
re.VERBOSE)
See it on regex101.com.

Searching an input string for occurences of integers and characters using a single regular expression in Python

I have an input string which is considered valid only if it contains:
At least one character in [a-z]
At least one integer in [0-9], and
At least one character in [A-Z]
There is no constraint on the order of occurrence of any of the above. How can I write a single regular expression that validates my input string ?
Try this
^(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9]).*$
See it here online on Regexr
The ^ and $ are anchors which bind the pattern to the start and the end of the string.
The (?=...) are lookahead assertions. they check if the pattern after the = is ahead but they don't match it. So to match something there needs to be a real pattern also. Here it is the .* at the end.
The .* would match the empty string also, but as soon as one of the lookaheads fail, the complete expression will fail.
For those who are concerned about the readability and maintainability, use the re.X modifier to allow pretty and commented regexes:
reg = re.compile(r'''
^ # Match the start of the string
(?=.*[a-z]) # Check if there is a lowercase letter in the string
(?=.*[A-Z]) # Check if there is a uppercase letter in the string
(?=.*[0-9]) # Check if there is a digit in the string
.* # Match the string
$ # Match the end of the string
'''
, re.X) # eXtented option whitespace is not part of he pattern for better readability
Do you need regular expression?
import string
if any(c in string.uppercase for c in t) and any(c in string.lowercase for c in t) and any(c in string.digits for c in t):
or an improved version of #YuvalAdam's improvement:
if all(any(c in x for c in t) for x in (string.uppercase, string.lowercase, string.digits)):

Categories