I am making a bot with a option to not post if the username is not a certain user.
Reddit usernames can contain letters in both cases, and have numbers.
Which regex can be used to identify such a username? The format is /u/USERNAME where username can have letters of both cases and numbers, such as ExaMp13.
I have tried /u/[A-Z][a-z][0-9]
Valid characters for Reddit usernames are preceded by /u/ and include:
UPPERCASE
lowercase
Digits
Underscore
Hyphen
This regex meets those criteria:
/u/[A-Za-z0-9_-]+
Brief
Thanks for updating your post with something you've tried as this gives us an idea of what you may not be understanding (and helps us explain where you went wrong and how to fix it).
Your regex doesn't work because it checks for [A-Z] followed by [a-z], then by [0-9]. So your regex will only match something like Be1
Answer
What you should instead try for is [a-zA-Z0-9] or \w and specifying a quantifier such as + (one or more).
For your specific problem, you should use \/u\/(\w+) (or /u/(\w+) since python doesn't care about escaping). This will allow you to then check the first capture group against a list of users you want to not post for.
These regular expressions will ensure that it matches /u/ followed by any word character [a-zA-Z0-9_] between 1 and unlimited times.
See a working example here
You can use a regex like this:
/u/\w+
Related
I apologise for the amount of text, but I cannot wrap my head around this and I would like to make my problem clear.
I am currently attempting to create a regex expression to find the end of a website/email link in order to then process the rest of the address. I have decided to look for the ending of the address (eg. '.com', '.org', '.net'); however, I am having difficulty in two areas when dealing with this. (I have chosen this method as it is the best fit for the current project)
Firstly I am trying to get around accidentally hindering users typing words with these keywords within them (eg. '"org"anisation', 'try this "or g"o to'). How I have tackled this is, as an example, the regex:
org(?!\w) - To skip the match if there are letters directly after the keyword.
The secondary problem is finding extra parts of an address (eg. 'www.website."org".uk') which would not be matched. To combat this, as an example, I have used the regex:
org((\W*|\.|dot)\w\w) - In an attempt to find the first two letters after the keyword, as most extensions are only two letters.
The Main Problem:
In order to prevent both of the above situations I have used the regex akin to:
org(.|dot)\w\w|(?!\w)
However, I am not as versed as I would like to be in Regex to find a solution and I understand that this would not create correct results. I know there is a form of 'If this then that' within Regex but I just cant seem to understand the online documentation I have found on the subject.
If possible would someone be able to explain how I may go about creating a system to say:
IF: NOT org(\w)
ELSE IF: org(.|dot)
THEN: MATCH org(.|dot)\w\w
ELSE: MATCH org
I would really appreciate any help on the matter, this has been on my mind for a while now. I would just like to see it through, but I just do not possess the required knowledge.
Edit:
Test cases the Regex would need to pass (Specifically for the 'org' regex for these examples):
(I have marked matches in square brackets '[ ]', and I have marked possible matches to be disregarded with '< >' )
"Hello, please come and check out my website: www.website.[org]"
"I have just uploaded a new game at games.[org.uk]"
"If you would like quote please email me at email#email.[org.ru]"
"I have just made a new <org>anisation website at website.[org], please get in contact at name.name#email.[org.us]"
"For more info check info.[org] <or g>o to info.[org.uk]"
I hope this allows for a better insight to what the Regex needs to do.
The following regex:
(?i)(?<=\.)org(?:\.[a-z]{2})?\b
should do the work for you.
demo:
https://regex101.com/r/8F9qbQ/2/
explanations:
(?i) to activate the case as insensitive (.ORG or .org)
(?<=.) forces that there is a . before org to avoid matches when org is actually a part of a word.
org to match ORG or org
(?:...)? non capturing group that can appear 0 to 1 time
\.[a-zA-Z]{2} dot followed by exactly 2 letters
\b word boundary constraint
There are some other simpler way to catch any website, but assuming that you exactly need the feature IF: NOT org(\w) ELSE IF: org(.|dot) THEN: MATCH org(.|dot)\w\w ELSE: MATCH org, then you can use:
org(?!\w)(\.\w\w)?
It will match:
"org.uk" of www.domain.org.uk
"org" of www.domain.org
But will not match www.domain.orgzz and orgzz
Explanation:
The org(?!\w) part will match org that is not followed by a letter character. It will match the org of org, org of org. but will not match orgzz.
Then, if we already have the org, we will try if we can match additional (\.\w\w) by adding the quantifier ? which means match if there is any, which will match the \.uk but it is not necessary.
I made a little regex that captures a website as long as it starts with 'www.' that is followed by some characters with a following '.'.
import re
matcher = re.compile('(www\.\S*\.\S*)') #matches any website with layout www.whatever
string = 'they sky is very blue www.harvard.edu.co see nothing else triggers it, www, org'
match = re.search(matcher, string).group(1)
#output
#'www.harvard.edu.co'
Now you can tighten this up as needed to avoid false positives.
I am trying to split text of clinical trials into a list of fields. Here is an example doc: https://obazuretest.blob.core.windows.net/stackoverflowquestion/NCT00000113.txt. Desired output is of the form: [[Date:<date>],[URL:<url>],[Org Study ID:<id>],...,[Keywords:<keywords>]]
I am using re.split(r"\n\n[^\s]", text) to split at paragraphs that start with a character other than space (to avoid splitting at the indented paragraphs within a field). This is all good, except the resulting fields are all (except the first field) missing their first character. Unfortunately, it is not possible to use string.partition with a regex.
I can add back the first characters by finding them using re.findall(r"\n\n[^\s]", text), but this requires a second iteration through the entire text (and seems clunky).
I am thinking it makes sense to use re.findall with some regex that matches all fields, but I am getting stuck. re.findall(r"[^\s].+\n\n") only matches the single line fields.
I'm not so experienced with regular expressions, so I apologize if the answer to this question is easily found elsewhere. Thanks for the help!
You may use a positive lookahead instead of a negated character class:
re.split(r"\n\n(?=\S)", text)
Now, it will only match 2 newlines if they are followed with a non-whitespace char.
Also, if there may be 2 or more newlines, you'd better use a {2,} limiting quantifier:
re.split(r"\n{2,}(?=\S)", text)
See the Python demo and a regex demo.
You want a lookahead. You also might want it to be more flexible as far as how many newlines / what newline characters. You might try this:
import re
r = re.compile(r"""(\r\n|\r|\n)+(?=\S)""")
l = r.split(text)
though this does seem to insert \r\n characters into the list... Hmm.
UPDATED
I want to find a string within a big text
..."img good img two_apple.txt"
Want to extract the two_apples.txt from a text, but it can change to one_apple, three_apple..so on...
When I try to use lookbehinds, it matches text all the way from the beginning.
You are mis-using lookarounds. Looks like you dont even NEED a lookaround:
pattern = r'src="images/(.+?.png")'
should work for you. As my comment suggests though, using regex is not recommended for parsing HTML/XML style documents but you do you.
EDIT - accommodate your edit:
Now that I understand your problem more, I can see why you would want to use a look-around. However, since you are looking for a file name, you know there aren't going to be any spaces in the name, so you can just ensure that your capturing token does not include spaces:
pattern = r'src="img (\w+?.png")'
^ ensure there is a space HERE because of how your text is
\w - \w is equivalent to [a-zA-Z0-9_] (any letters, numbers or underscore)
This removes the greediness of capture the first 'img ' string that pops up and ensures your capture group doesnt have any spaces.
by using \w, I am assuming you are only expecting _ and letter characters. to include anything else, make your own character group with [any characters you want to capture in here]
" ([^ ]+_apple\.txt)"
Starts with a space, ends with _apple.txt. The middle bit is anything-except-a-space which stops it matching "good img two". Parentheses to capture the bit you care about.
Try it here: https://regex101.com/r/wO7lG3/2
I'm using a snippet i found on stackexchange that finds all url's in a string, using re.findall(). It works perfectly, however to further my knowledge I would like to know how exactly it works. The code is as follows-
re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', site)
As far as i understand, its finding all strings starting with http or https (is that why the [s] is in square brackets?) but I'm not really sure about all the stuff after- the (?:[etc etc etc]))+. I think the stuff in the square brackets eg. [a-zA-Z] is meaning all letters from a to z caps or not, but what about the rest of the stuff? And how is it working to only get the url and not random string at the end of the url?
Thanks in advance :)
Using this link you can get your regex explained:
Your regex explained
To add a bit more:
[s]? means "an optional 's' character" but that's because of the ? not of the brackets [I think they are superfluous.
Space isn't one of the accepted characters so it would stop there indeed. Same for '/'. It is not literally mentioned nor is it part of the character range $-_ (see http://www.asciitable.com/index/asciifull.gif).
(?:%[0-9a-fA-F][0-9a-fA-F]) this matches hexadecimal character codes in URLs e.g. %2f for the '/' character.
A non-capturing group means that the group is matched but that the resulting match is not stored in the regex return value, i.e. you cannot extract that matching bit of the string after the regex has been run against your string.
Hi I'm new to regexes.
I have a string that I want to match any number of A-Z a-z 0-9 - and _
I've tried the following in python however it always matches, even the empty space. Can someone tell me why that is?
re.match(r'[A-Za-z0-9_-]+', 'gfds9 41.-=,434')
Your regex matches one or more of those characters. Your text starts with one or more of those characters, hence it matches. If you want it to only match those characters then you have to match them from the beginning to the end of the text.
re.match(r'^[A-Za-z0-9_-]+$', 'gfds9 41.-=,434')
Try the alternative for it maybe it will work for you:
[\w-]+
EDIT:
Although the initial regex you provided also works for me.