Regex question to have substring match given strings - python

I want regex to combine
".*SimpleTaskv9MoreDetails.*"
or
".*SimpleTaskv10MoreDetails.*"
How can I create regex to match both of them? I know that below one matches v8 and v9
".*SimpleTaskv[89]MoreDetails.*"
But if I want both v9 and v10 to be accepted? How do I do it?

Use alternatives:
.*SimpleTaskv(?:9|10)MoreDetails.*

If you want to generally match your existing pattern but with any non-zero-length sequence of digits in the middle, you could do this:
.*SimpleTaskv\d+MoreDetails.*

Related

Regex to find string with multiple dot (.) characters in between [duplicate]

I need to a regex to validate a string like "foo.com". A word which contains a dot. I have tried several but could not get it work.
The patterns I have tried:
(\\w+\\.)
(\\w+.)
(\\w.)
(\\W+\\.)
Can some one please help me one this.
Thanks,
Use regex with character class
([\\w.]+)
If you just want to contain single . then use
(\\w+\\.\\w+)
In case you want multiple . which is not adjacent then use
(\\w+(?:\\.\\w+)+)
To validate a string that contains exactly one dot and at least two letters around use match for
\w+\.\w+
which in Java is denoted as
\\w+\\.\\w+
This regex works:
[\w\[.\]\\]+
Tested for following combinations:
foo.com
foo.co.in
foo...
..foo
I understand your question like, you need a regex to match a word which has a single dot in-between the word (not first or last).
Then below regex will satisfy your need.
^\\w+\\.\\w+$

Check if expression matches a regex

I would like to validate the following expressions :
"CODE1:123/CODE2:3467/CODE1:7686"
"CODE1:9090"
"CODE2:078/CODE1:7788/CODE1:333"
"CODE2:77"
In my case, the patterns 'CODE1:xx' or 'CODE2:xx' are given in any different orders.
I can sort the patterns to make them like 'CODE1:XX/CODE1:YY/CODE2:ZZ'
and check if matches something like
r'[CODE1:\d+]*[CODE2:\d+]*'
Could we make it shorter : is it possible to solve this with one regex matcher ?
Thanks
This regex will provide a match for all 4 cases:
CODE[12]:\d+(?:/CODE[12]:\d+)*
See here: https://regex101.com/r/wn30a5/1
It will match CODE followed by either 1 or 2 and then a colon : with digits; and optionally followed by a slash / and that pattern again, any number of times. So a trailing slash won't be permitted and it can appear as a single code too; and in any order; so it doesn't need to be sorted first.
CODE is static but after it the digit is dynamic, to make it shorter just use CODE\d:\d+
if you want to match only two digit after : use CODE\d:\d{2}

Python regex match all sentences include either wordA or wordB [duplicate]

I'm creating a javascript regex to match queries in a search engine string. I am having a problem with alternation. I have the following regex:
.*baidu.com.*[/?].*wd{1}=
I want to be able to match strings that have the string 'word' or 'qw' in addition to 'wd', but everything I try is unsuccessful. I thought I would be able to do something like the following:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
but it does not seem to work.
replace [wd|word|qw] with (wd|word|qw) or (?:wd|word|qw).
[] denotes character sets, () denotes logical groupings.
Your expression:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
does need a few changes, including [wd|word|qw] to (wd|word|qw) and getting rid of the redundant {1}, like so:
.*baidu.com.*[/?].*(wd|word|qw)=
But you also need to understand that the first part of your expression (.*baidu.com.*[/?].*) will match baidu.com hello what spelling/handle????????? or hbaidu-com/ or even something like lkas----jhdf lkja$##!3hdsfbaidugcomlaksjhdf.[($?lakshf, because the dot (.) matches any character except newlines... to match a literal dot, you have to escape it with a backslash (like \.)
There are several approaches you could take to match things in a URL, but we could help you more if you tell us what you are trying to do or accomplish - perhaps regex is not the best solution or (EDIT) only part of the best solution?

RegEx for re-occurring phrase

I have the following phrase:
05/30/2016 07:02 AM (GMT+02:00) added by XXX YYY (PID-000301):\tSome_alphanum_text_Some_alphanum_text_Some_alphanum_text_Some_alphanum_text\t\t*************************************************************************************************\t05/12/2016 02:03 PM (GMT+02:00) added by ZZZ AAA (PID-000301):\tSome_other_alphanum_text_Some_other_alphanum_text_Some_other_alphanum_text_Some_other_alphanum_text\t\t
I would like to write a RegEx which is just going to scoop up for me only 'Some_alphanum_text' and 'Some_other_alphanum_text'.
So far I was trying my luck with something like this:
r'(?:.+\(PID-\d{6}\):)(.+)'
But it is only giving me the 'Some_other_alphanum_text' occurrence.
There can be more than 2 unique strings I will need to scoop out from this mess of a text. Any ideas?
You need to replace .+ with something that only matches what you want to return. Since you only want to match alphanumeric text, use \w instead of .
r'(?:\(PID-\d{6}\):)\s*(\w+)'
You need \s* before the second group because the whitespace before the alphanumeric text won't match \w+.
You also don't need .+ at the beginning. The match will just begin where it finds PID.
DEMO
I believe you need this regex:
\(PID-\d{6}\):\\t(.+?)(?:\\t){2}
regex101
I think you could use this to find all the instances of text occurring between "\t"s
I didn't change the regex area to be a code block so it has not worked.
Now it works! One thing you should consider is that there could be no '\t'. But
every matched text follows a date format such as 05/12/2016 02:03 or ends.
\(PID-\d{6}\)[\n\r\t\s]*:(?:.|[\n\r\t\s])*?(?=[0-9]{2}\/[0-9]{2}\/[0-9]{4}[\n\r\t\s]*[0-9]{2}:[0-9]{2}|$)

How to use \b word boundary in pandas str.contains?

Is there an equivalent when using str.contains?
the following code is mistakenly listing "Said Business School" in the category because of 'Sa.' If I could create a wordboundary it would solve the problem. Putting a space after messes this up. I am using pandas, which are the dfs. I know I can use regex, but just curious if i can use strings to make it faster
gprivate_n = ('Co|Inc|Llc|Group|Ltd|Corp|Plc|Sa |Insurance|Ag|As|Media|&|Corporation')
df.loc[df[df.Name.str.contains('{0}'.format(gprivate_n))].index, "Private"] = 1
This is just the same old Python issue in regexes where '\b' should be passed either as raw-string r'\b...'. Or less desirably, double-escaping ('\\b').
So your regex should be:
gprivate_n = (r'\b(Co|Inc|Llc|Group|Ltd|Corp|Plc|Sa |Insurance|Ag|As|Media|&|Corporation)')
A word boundary is not a character, so you can't find it with .contains. You need to either use regex or split the strings into words and then check for membership of each of those words in the set you currently have defined in gprivate_n.

Categories