I want to check if string contains punctuation or not so a continuous sequence of exclamation, question & both.
By continuous, it means more than 2 times. Just like below,
#If sentence contains !!!
exc = re.compile(r"(.)\!{2}")
word["cont_exclamation"] = if exc.search(sent[i]) else not(found)
#If sentence contains ???
reg = re.compile(r"(.)\?{2}")
word["cont_question"] = if reg.search(sent[i]) else not(found)
But now I want to find both, exclamation and question marks so for example, hello??! or hey!! or dude!?!
Also, what if I want ? and ! both but more than 2 any of them.
I dont know regex properly so any help would be great.
Use the regex '[?!]{3,}' which means match the ? or ! characters 3 or more times (if continous = more than two times). Quoting is not needed inside character class.
Add more punctuation characters to the char class as needed
try re.compile(r"(.)[\?\!]{2}")
regex = re.compile(r"(.)(\?|\!){2}")
edit: Typing "regex tutorial" into google gives more info than you possibly need. This tutorial looks particularly well-balanced between conciseness and completeness.
Particularly (i.m.o.) useful tricks that are often not mentioned:
use +? and *? to switch from greedy to lazy match. I.e. match as few characters as possible instead of as much as possible. Example text: #ab# #de# --> #.*?# matches #ab# only (not #ab# #de#)
parentheses create a capture group by default. If you don't want this, you can use (?:...).
Most importantly, comment each regexp with a human-readable explanation. Future-you will be grateful. :-)
Related
I am new to regexes.
I have the following string : \n(941)\n364\nShackle\n(941)\nRivet\n105\nTop
Out of this string, I want to extract Rivet and I already have (941) as a string in a variable.
My thought process was like this:
Find all the (941)s
filter the results by checking if the string after (941) is followed by \n, followed by a word, and ending with \n
I made a regex for the 2nd part: \n[\w\s\'\d\-\/\.]+$\n.
The problem I am facing is that because of the parenthesis in (941) the regex is taking 941 as a group. In the 3rd step the regex may be wrong, which I can fix later, but 1st I needed help in finding the 2nd (941) so then I can apply the 3rd step on that.
PS.
I know I can use python string methods like find and then loop over the searches, but I wanted to see if this can be done directly using regex only.
I have tried the following regex: (?:...), (941){1} and the make regex literal character \ like this \(941\) with no useful results. Maybe I am using them wrong.
Just wanted to know if it is possible to be done using regex. Though it might be useful for others too or a good share for future viewers.
Thanks!
Assuming:
You want to avoid matching only digits;
Want to match a substring made of word-characters (thus including possible digits);
Try to escape the variable and use it in the regular expression through f-string:
import re
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
var1 = '(941)'
var2 = re.escape(var1)
m = re.findall(fr'{var2}\n(?!\d+\n)(\w+)', s)[0]
print(m)
Prints:
Rivet
If you have text in a variable that should be matched exactly, use re.escape() to escape it when substituting into the regexp.
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
num = '(941)'
re.findall(rf'(?<=\n{re.escape(num)}\n)[\w\s\'\d\-\/\.]+(?=\n)', s)
This puts (941)\n in a lookbehind, so it's not included in the match. This avoids a problem with the \n at the end of one match overlapping with the \n at the beginning of the next.
I am looking for a pattern that matches everything until the first occurrence of a specific character, say a ";" - a semicolon.
I wrote this:
/^(.*);/
But it actually matches everything (including the semicolon) until the last occurrence of a semicolon.
You need
/^[^;]*/
The [^;] is a character class, it matches everything but a semicolon.
^ (start of line anchor) is added to the beginning of the regex so only the first match on each line is captured. This may or may not be required, depending on whether possible subsequent matches are desired.
To cite the perlre manpage:
You can specify a character class, by enclosing a list of characters in [] , which will match any character from the list. If the first character after the "[" is "^", the class matches any character not in the list.
This should work in most regex dialects.
Would;
/^(.*?);/
work?
The ? is a lazy operator, so the regex grabs as little as possible before matching the ;.
/^[^;]*/
The [^;] says match anything except a semicolon. The square brackets are a set matching operator, it's essentially, match any character in this set of characters, the ^ at the start makes it an inverse match, so match anything not in this set.
None of the proposed answers did work for me. (e.g. in notepad++)
But
^.*?(?=\;)
did.
Try /[^;]*/
Google regex character classes for details.
sample text:
"this is a test sentence; to prove this regex; that is g;iven below"
If for example we have the sample text above, the regex /(.*?\;)/ will give you everything until the first occurence of semicolon (;), including the semicolon: "this is a test sentence;"
Try /[^;]*/
That's a negating character class.
This was very helpful for me as I was trying to figure out how to match all the characters in an xml tag including attributes. I was running into the "matches everything to the end" problem with:
/<simpleChoice.*>/
but was able to resolve the issue with:
/<simpleChoice[^>]*>/
after reading this post. Thanks all.
this is not a regex solution, but something simple enough for your problem description. Just split your string and get the first item from your array.
$str = "match everything until first ; blah ; blah end ";
$s = explode(";",$str,2);
print $s[0];
output
$ php test.php
match everything until first
This will match up to the first occurrence only in each string and will ignore subsequent occurrences.
/^([^;]*);*/
"/^([^\/]*)\/$/" worked for me, to get only top "folders" from an array like:
a/ <- this
a/b/
c/ <- this
c/d/
/d/e/
f/ <- this
Really kinda sad that no one has given you the correct answer....
In regex, ? makes it non greedy. By default regex will match as much as it can (greedy)
Simply add a ? and it will be non-greedy and match as little as possible!
Good luck, hope that helps.
This works for getting the content from the beginning of a line till the first word,
/^.*?([^\s]+)/gm
I faced a similar problem including all the characters until the first comma after the word entity_id. The solution that worked was this in Bigquery:
SELECT regexp_extract(line_items,r'entity_id*[^,]*')
I'm creating a javascript regex to match queries in a search engine string. I am having a problem with alternation. I have the following regex:
.*baidu.com.*[/?].*wd{1}=
I want to be able to match strings that have the string 'word' or 'qw' in addition to 'wd', but everything I try is unsuccessful. I thought I would be able to do something like the following:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
but it does not seem to work.
replace [wd|word|qw] with (wd|word|qw) or (?:wd|word|qw).
[] denotes character sets, () denotes logical groupings.
Your expression:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
does need a few changes, including [wd|word|qw] to (wd|word|qw) and getting rid of the redundant {1}, like so:
.*baidu.com.*[/?].*(wd|word|qw)=
But you also need to understand that the first part of your expression (.*baidu.com.*[/?].*) will match baidu.com hello what spelling/handle????????? or hbaidu-com/ or even something like lkas----jhdf lkja$##!3hdsfbaidugcomlaksjhdf.[($?lakshf, because the dot (.) matches any character except newlines... to match a literal dot, you have to escape it with a backslash (like \.)
There are several approaches you could take to match things in a URL, but we could help you more if you tell us what you are trying to do or accomplish - perhaps regex is not the best solution or (EDIT) only part of the best solution?
I'm fairly new at regex, and I've run into a problem that I cannot figure out:
I am trying to match a set of characters that start with an arbitrary number of A-Z, 0-9, and _ characters that can optionally be followed by a number enclosed in a single set of parentheses and can be separated from the original string by a space (or not)
Examples of what this should find:
_ABCD1E
_123FD(13)
ABDF1G (2)
This is my current regex expression:
[A-Z_0-9]+\s*\({0,1}[\d]*\){0,1}
It's finding everything just fine, but a problem exists if I have the following:
_ABCDE )
It should only grab _ABCDE and not the " )" but it currently grabs '_ABCDE )'
Is there some way I can grab the (#) but not get extra characters if that entire pattern does not exist?
If possible, please explain syntax as I am aiming to learn, not just get the answer.
ANSWER: The following code is working for what I needed so far:
[A-Z_0-9]+(\s*\([\d]+\)){0,1}
# or, as has been mentioned, the above can be simplified
# and cleaned up a bit to be
[A-Z_0-9]+(\s*\(\d+\))?
# The [] around \d are unnecessary and {0,1} is equivalent to ?
Adding the parentheses around the (#) pattern allows for the use of ? or {0,1} on the entire pattern. I also changed the [\d]* to be [\d]+ to ensure at least one number inside of the parentheses.
Thanks for the fast answers, all!
Your regex says that each paren (open & closed) may or may not be there, INDEPENDENTLY. Instead, you should say that the number-enclosed-in-parens may or may not be there:
(\([\d]*\)){0,1}
Note that this allows for there to be nothing in the parens; that's what your regex said, but I'm not clear that's what you actually want.
how about
^[A-Z0-9_]+\s*(\([0-9]+\))?$
btw, from your example, the first part accepts not only [A-Z_], but also [0-9]
This seems to do the job.
[1-9A-Z_]+\s*(?:\([1-9]*\))?
It seems like you want the following regex:
^[A-Z\d_]+(\s*\(\d+\))?$
I used a non-capturing group to avoid grouping matching in results:
>>> pattern = r'[A-Z_]+\s*(?:\(\d+\)|\d*)'
>>> l = ['_ABCD1E', '_123FD(13)', 'ABDF1G (2)', '_ABCDE )', 'A_B (15)', 'E (345']
>>> [re.search(pattern , i).group() for i in l]
['_ABCD1', '_123', 'ABDF1', '_ABCDE ', 'A_B (15)', 'E ']
Update:
This question was an epic failure, but here's the working solution. It's based on Gumbo's answer (Gumbo's was close to working so I chose it as the accepted answer):
Solution:
r'(?=[a-zA-Z0-9\-]{4,25}$)^[a-zA-Z0-9]+(\-[a-zA-Z0-9]+)*$'
Original Question (albeit, after 3 edits)
I'm using Python and I'm not trying to extract the value, but rather test to make sure it fits the pattern.
allowed values:
spam123-spam-eggs-eggs1
spam123-eggs123
spam
1234
eggs123
Not allowed values:
eggs1-
-spam123
spam--spam
I just can't have a dash at the starting or the end. There is a question on here that works in the opposite direction by getting the string value after the fact, but I simply need to test for the value so that I can disallow it. Also, it can be a maximum of 25 chars long, but a minimum of 4 chars long. Also, no 2 dashes can touch each other.
Here's what I've come up with after some experimentation with lookbehind, etc:
# Nothing here
Try this regular expression:
^[a-zA-Z0-9]+(-[a-zA-Z0-9]+)*$
This regular expression does only allow hyphens to separate sequences of one or more characters of [a-zA-Z0-9].
Edit Following up your comment: The expression (…)* allows the part inside the group to be repeated zero or more times. That means
a(bc)*
is the same as
a|abc|abcbc|abcbcbc|abcbcbcbc|…
Edit Now that you changed the requirements: As you probably don’t want to restrict each hyphen separated part of the words in its length, you will need a look-ahead assertion to take the length into account:
(?=[a-zA-Z0-9-]{4,25}$)^[a-zA-Z0-9]+(-[a-zA-Z0-9]+)*$
The current regex is simple and fairly readable. Rather than making it long and complicated, have you considered applying the other constraints with normal Python string processing tools?
import re
def fits_pattern(string):
if (4 <= len(string) <= 25 and
"--" not in string and
not string.startswith("-") and
not string.endswith("-")):
return re.match(r"[a-zA-Z0-9\-]", string)
else:
return None
It should be something like this:
^[a-zA-Z0-9]+(-[a-zA-Z0-9]+)*$
You are telling it to look for only one char, either a-z, A-Z, 0-9 or -, that is what [] does.
So if you do [abc] you will match only "a", or "b" or "c". not "abc"
Have fun.
If you simply don't want a dash at the end and beginning, try ^[^-].*?[^-]$
Edit: Bah, you keep changing it.