How to use \b word boundary in pandas str.contains? - python

Is there an equivalent when using str.contains?
the following code is mistakenly listing "Said Business School" in the category because of 'Sa.' If I could create a wordboundary it would solve the problem. Putting a space after messes this up. I am using pandas, which are the dfs. I know I can use regex, but just curious if i can use strings to make it faster
gprivate_n = ('Co|Inc|Llc|Group|Ltd|Corp|Plc|Sa |Insurance|Ag|As|Media|&|Corporation')
df.loc[df[df.Name.str.contains('{0}'.format(gprivate_n))].index, "Private"] = 1

This is just the same old Python issue in regexes where '\b' should be passed either as raw-string r'\b...'. Or less desirably, double-escaping ('\\b').
So your regex should be:
gprivate_n = (r'\b(Co|Inc|Llc|Group|Ltd|Corp|Plc|Sa |Insurance|Ag|As|Media|&|Corporation)')

A word boundary is not a character, so you can't find it with .contains. You need to either use regex or split the strings into words and then check for membership of each of those words in the set you currently have defined in gprivate_n.

Related

Use regex to remove a substring that matches a beginning of a substring through the following comma

I haven't found any helpful Regex tools to help me figure this complicated pattern out.
I have the following string:
Myfirstname Mylastname, Department of Mydepartment, Mytitle, The University of Me; 4-1-1, Hong,Bunk, Tokyo 113-8655, Japan E-mail:my.email#example.jp, Tel:00-00-222-1171, Fax:00-00-225-3386
I am trying to learn enough Regex patterns to remove the substrings one at a time:
E-mail:my.email#example.jp
Tel:00-00-222-1171
Fax:00-00-225-3386
So I think the correct pattern would be to remove a given word (ie., "E-mail", "Tel") all the way through the following comma.
Is type of dynamic pattern possible in Regex?
I am performing the match in Python, however, I don't think that would matter too much.
Also, I know the data string looks comma separated, and it is. However there is no guarantee of preserving the order of those fields. That's why I'm trying to use a Regex match.
How about this regex:
<YOUR_WORD>.*?(?=(,|($)))
Explanation:
It looks for the word specified in <YOUR_WORD> placeholder
It looks for any kind of character afterwards
The search stops when it hits one of the two options:
It finds the character ,
It finds an end of the line
So:
E-mail.*?(?=(,|($)))
Will result in:
E-mail:my.email#example.jp
And
Fax.*?(?=(,|($)))
Will result in:
Fax:00-00-225-3386
If there are edge cases it misses - I would like to know, and whether it affects the performance/ is necessary.

Exact search of a string that has parenthesis using regex

I am new to regexes.
I have the following string : \n(941)\n364\nShackle\n(941)\nRivet\n105\nTop
Out of this string, I want to extract Rivet and I already have (941) as a string in a variable.
My thought process was like this:
Find all the (941)s
filter the results by checking if the string after (941) is followed by \n, followed by a word, and ending with \n
I made a regex for the 2nd part: \n[\w\s\'\d\-\/\.]+$\n.
The problem I am facing is that because of the parenthesis in (941) the regex is taking 941 as a group. In the 3rd step the regex may be wrong, which I can fix later, but 1st I needed help in finding the 2nd (941) so then I can apply the 3rd step on that.
PS.
I know I can use python string methods like find and then loop over the searches, but I wanted to see if this can be done directly using regex only.
I have tried the following regex: (?:...), (941){1} and the make regex literal character \ like this \(941\) with no useful results. Maybe I am using them wrong.
Just wanted to know if it is possible to be done using regex. Though it might be useful for others too or a good share for future viewers.
Thanks!
Assuming:
You want to avoid matching only digits;
Want to match a substring made of word-characters (thus including possible digits);
Try to escape the variable and use it in the regular expression through f-string:
import re
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
var1 = '(941)'
var2 = re.escape(var1)
m = re.findall(fr'{var2}\n(?!\d+\n)(\w+)', s)[0]
print(m)
Prints:
Rivet
If you have text in a variable that should be matched exactly, use re.escape() to escape it when substituting into the regexp.
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
num = '(941)'
re.findall(rf'(?<=\n{re.escape(num)}\n)[\w\s\'\d\-\/\.]+(?=\n)', s)
This puts (941)\n in a lookbehind, so it's not included in the match. This avoids a problem with the \n at the end of one match overlapping with the \n at the beginning of the next.

Python regex match all sentences include either wordA or wordB [duplicate]

I'm creating a javascript regex to match queries in a search engine string. I am having a problem with alternation. I have the following regex:
.*baidu.com.*[/?].*wd{1}=
I want to be able to match strings that have the string 'word' or 'qw' in addition to 'wd', but everything I try is unsuccessful. I thought I would be able to do something like the following:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
but it does not seem to work.
replace [wd|word|qw] with (wd|word|qw) or (?:wd|word|qw).
[] denotes character sets, () denotes logical groupings.
Your expression:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
does need a few changes, including [wd|word|qw] to (wd|word|qw) and getting rid of the redundant {1}, like so:
.*baidu.com.*[/?].*(wd|word|qw)=
But you also need to understand that the first part of your expression (.*baidu.com.*[/?].*) will match baidu.com hello what spelling/handle????????? or hbaidu-com/ or even something like lkas----jhdf lkja$##!3hdsfbaidugcomlaksjhdf.[($?lakshf, because the dot (.) matches any character except newlines... to match a literal dot, you have to escape it with a backslash (like \.)
There are several approaches you could take to match things in a URL, but we could help you more if you tell us what you are trying to do or accomplish - perhaps regex is not the best solution or (EDIT) only part of the best solution?

Regex match only if word count between 1-50

So I have this code:
(r'\[quote\](.+?)\[/quote\]')
What I want to do is to change the regex so it only matches if the text within [quote] [/quote] is between 1-50 words.
Is there any easy way to do this?
Edit: Removed confusing html code in the regex example. I am NOT trying to match HTML.
Sure there is, depending on how you define a "word."
I would do so separately from regex, but if you want to use regex, you could probably do:
r"\[quote\](.+?\s){1,49}[/quote\]"
That will match between 2 and 50 words (since it demands a trailing \s, it can't match ONE)
Crud, that also won't match the LAST word, so let's do this instead:
r"\[quote\](.+?(?:\s.+?){1,49})\[/quote\]"
This is a definite misuse of regexes for a lot of reasons, not the least of which is the problem matching [X]HTML as #Hyperboreus noted, but if you really insist you could do something along the lines of ([a-zA-Z0-9]\s){1}{49}.
For the record, I don't recommend this.

string consists of punctuation

I want to check if string contains punctuation or not so a continuous sequence of exclamation, question & both.
By continuous, it means more than 2 times. Just like below,
#If sentence contains !!!
exc = re.compile(r"(.)\!{2}")
word["cont_exclamation"] = if exc.search(sent[i]) else not(found)
#If sentence contains ???
reg = re.compile(r"(.)\?{2}")
word["cont_question"] = if reg.search(sent[i]) else not(found)
But now I want to find both, exclamation and question marks so for example, hello??! or hey!! or dude!?!
Also, what if I want ? and ! both but more than 2 any of them.
I dont know regex properly so any help would be great.
Use the regex '[?!]{3,}' which means match the ? or ! characters 3 or more times (if continous = more than two times). Quoting is not needed inside character class.
Add more punctuation characters to the char class as needed
try re.compile(r"(.)[\?\!]{2}")
regex = re.compile(r"(.)(\?|\!){2}")
edit: Typing "regex tutorial" into google gives more info than you possibly need. This tutorial looks particularly well-balanced between conciseness and completeness.
Particularly (i.m.o.) useful tricks that are often not mentioned:
use +? and *? to switch from greedy to lazy match. I.e. match as few characters as possible instead of as much as possible. Example text: #ab# #de# --> #.*?# matches #ab# only (not #ab# #de#)
parentheses create a capture group by default. If you don't want this, you can use (?:...).
Most importantly, comment each regexp with a human-readable explanation. Future-you will be grateful. :-)

Categories