Regex: Identify string occurrence in first X characters after keyword - python

Suppose the following string:
text = r"Microsoft enables digital transformation for the era of an intelligent cloud and an intelligent edge.
SOURCE Microsoft Corp."
Goal:
I want to check if the company's name (Microsoft in the above example) occurs within the first X (250 for example) characters after the keyword "SOURCE".
Attempt:
source = re.compile(r"SOURCE.*")
re.findall(source,text)
#output ['SOURCE Microsoft Corp.']
In order to account for the character limitation in which the keyword should occur, I thought of using the .split() function on the output string and count the position at which the company's name occurs. This should work just fine if the company's name consists of one word only.
However, in cases where the company name includes multiple words (e.g., "Procter & Gamble") splitting the output string would result in ['SOURCE', 'Procter', '&', 'Gamble'] so that searching for the position of "Procter & Gamble" in this list wouldn't give back any results.
Is there a way I can implement the restriction that the company name has to occur after X characters in the regex command?

A performant alternative to a regex would be str.find with start and end parameters:
p1 = t.find('SOURCE')
p2 = t.find('Microsoft', p1, p1 + limit - len('SOURCE'))
p2 will be > 0 if 'Microsoft' is found within limit chars from 'SOURCE' and -1 otherwise.

You could put something between SOURCE and the company name. So if the company name (Microsoft in this example) is 9 characters and you need it to be within the first 200 characters directly following SOURCE there can be from 0 up to a maximum 200-9=191 characters before the company name. So you would write:
re.findall('SOURCE.{0,191}Microsoft', text)
The .{a,b} expression would match any character from a to b number of times.

Another solution using re (regex101):
import re
text = """Microsoft enables digital transformation for the era of an intelligent cloud and an intelligent edge.
SOURCE Microsoft Corp."""
pat = re.compile(r"(?<=SOURCE)(?=.{,250}Microsoft).*?Microsoft", flags=re.S)
if pat.search(text):
print("Found")
Prints:
Found

You don't need regex for this. You can slice the string after finding the upper index of SOURCE with rfind, then check if the word is in the sliced string:
text = "Microsoft enables digital transformation for the era of an intelligent cloud and an intelligent edge. SOURCE Microsoft Corp."
print('Microsoft' in text[text.rfind('SOURCE'):250])
Output:
True

You can use the following RegEx to get everything after 'SOURCE' after X occurrences:
SOURCE.{250}(.*)
I only put 250 on the RegEx as an example. You can use any number you like.
For example, to match exactly 'Microsoft Corp.' you could do SOURCE.{1}(.*).
The parentheses define a RegEx capture group, which are basically output variables. In Python you can match the capture groups using re.findall():
>>> import re
>>> r = re.compile(r'SOURCE.{1}(.*)')
>>> r.findall('SOURCE Microsoft Corp.')
['Microsoft Corp.']
EDIT: Many of this post's answers rely on using 'Microsoft' for finding the company name, but that doesn't make much sense IMO

Related

Sanatize input to include whitespace in regex?

I'm using python 3.5 currently.
I am trying to make a tool that takes input, and does a regex search for said "Playername" and returns the matching result. I run into an interesting issue because this is videogame related, and some users have special characters in their names (Clan Tags).
To try to sanitize input, I am using re.escape, but I am not getting the behavior I expected out of it.
Example, I am allowing users to input partial matches, and use regex to find a player. So if I input Mall, it should be able to regex to find Mallachar, her is my current example matching setup.
regex_match = r".*" + player_name + r".*"
if re.match(regex_match, str(name_list), re.IGNORECASE):
player_list.append(players)
Because this is a system where user names are not unique, and a player can change their name, I am searching against a "list" of users.
Anyways, the issue I am running into is when people have spaces or clan tags. Example, if the clan ~DOG~ joins the server, and I have people with names ~DOG~ Master and ~DOG- Runner, if I feed in the string ~DOG~ Run, I get all matches to ~DOG~ .*.
My understanding is that re.escape should be escaping the space so it's a part of my search, so it should be trying to match this
.*~DOG~\sRun.*
But instead it seems to be running this, like it's ignoring everything after ~DOG~:
.*~DOG~.*
Am I misunderstanding how re.escape is?
You can use in operator to check if player_name is inside other string:
name_list = ['~DOG~ Master', '~DOG~ Runner']
player_name = '~DOG~ Run'
player_list = []
for name in name_list:
if player_name in name:
player_list.append(name)
print(player_list)
This prints:
['~DOG~ Runner']
Using in is probably the right way to solve this problem but on the use of regex question itself.
Adding a set of parens will let you use matches
python so_post.py
('~DOG~ Run',)
alexl#MBP000413 ~ :)% cat so_post.py
import re
regex_match = r".*(~DOG~ Run).*"
name = "~DOG~ Run"
match = re.match(regex_match, name, re.IGNORECASE)
print(match.groups())
Using named groups lets you use a specific name instead of just a general tuple of matches.
regex_match = r".*(?P<user_clan>~DOG~ Run).*"
name = "~DOG~ Run"
match = re.match(regex_match, name, re.IGNORECASE)
print(match.groups("user_clan"))

Python- Regular express without order

I want to extract for example 2 entities from a sentence. eg:
str1 = 'i am tom and i have a car'
I want to extract the word 'tom' or 'jack' as name if exist.
I also want to extract the word 'car' or 'bike' as property if exist
Now I can simply write 2 regular expressions:
re.search(r"(?P<name>tom|jack)", s).group('name')
re.search(r"(?P<property>car|bike)", s).group('property')
But I wonder if I can combine these two together.
The problem is I could not know the order of both name and property. So the following code
re.search(r"(?P<name>tom|jim).*(?P<property>car|bike)", s)
does not work for :
'str2 = i have a car and i am tom'
I tried to simply combine two order situation
re.search(r"(((?P<name>tom|jack).*(?P<property>car|bike))|((?P<property>car|bike).*(?P<name>tom|jack)))", s2)
it gives me "redefinition of group name" error unless I changed to
re.search(r"(((?P<name>tom|jack).*(?P<property>car|bike))|((?P<property2>car|bike).*(?P<name2>tom|jack)))", s2)
Question
How can i write a regular express to extract tom/jack as name and car/bike as property without considering the order?
Moreover
I don't want to simply list all the possible orders because it might be too many situations if i want to extract n kinds of entities.
Yes, it's possible but within lookarounds otherwise characters are consumed and engine pointer doesn't bother to go back for a new look up.
\A(?=.*(?P<name>tom|jack))(?=.*(?P<property>car|bike))
Live demo
Every pattern in a regex should match to lead a successful match. If they are not mandatory patterns make them optional.

Extract part of string according to pattern using regular expression Python

I have a files that follow a specific format which look something like this:
test_0800_20180102_filepath.csv
anotherone_0800_20180101_hello.csv
The numbers in the middle represent timestamps, so I would like to extract that information. I know that there is a specific pattern which will always be _time_date_, so essentially I want the part of the string that lies between the first and third underscores. I found some examples and somehow similar problems, but I am new to Python and I am having trouble adapting them.
This is what I have implemented thus far:
datetime = re.search(r"\d+_(\d+)_", "test_0800_20180102_filepath.csv")
But the result I get is only the date part:
20180102
But what I actually need is:
0800_20180101
That's quite simple:
match = re.search(r"_((\d+)_(\d+))_", your_string)
print(match.group(1)) # print time_date >> 0800_20180101
print(match.group(2)) # print time >> 0800
print(match.group(3)) # print date >> 20180101
Note that for such tasks the group operator () inside the regexp is really helpful, it allows you to access certain substrings of a bigger pattern without having to match each one individually (which can sometimes be much more ambiguous than matching a larger one).
The order in which you then access the groups is from 1-n_specified, where group 0 is the whole matched pattern. Groups themselves are assigned from left to right, as defined in your pattern.
On a side note, if you have control over it, use unix timestamps so you only have one number defining both date and time universally.
They key here is you want everything between the first and the third underscores on each line, so there is no need to worry about designing a regex to match your time and date pattern.
with open('myfile.txt', 'r') as f:
for line in f:
x = '_'.join(line.split('_')[1:3])
print(x)
The problem with your implementation is that you are only capturing the date part of your pattern. If you want to stick with a regex solution then simply move your parentheses to capture the entire pattern you want:
re.search(r"(\d+_\d+)_", "test_0800_20180102_filepath.csv").group(1)
gives:
'0800_20180102'
This is very easy to do with .split():
time = filename.split("_")[1]
date = filename.split("_")[2]

Can I combine these two regexes into a single regex? (Find `that` in `string` if `this` is anywhere in `string`)

As input I have a series of long strings, which may or may not have the pattern(s) I'm looking for. The strings that have the pattern(s) will have an identifier(s) somewhere in the string, but not necessarily directly preceding the pattern(s). Currently I'm using this logic to find what I'm looking for:
droid_name = re.compile("(r2-d2|c-3po)")
location = re.compile("pattern_of_numbered_sectors_where_theyre_located")
find_droid = re.findall(location, string) if re.match(droid_name, string) else not_the_droids_youre_looking_for
r2-d2 and c-3po won't be the same length.
Can I combine this logic into a single regex? Thanks!
EDIT:
I'm looking for a one-line solution because I have a number of different types of information that I want to extract from various strings, so I'm using a dictionary with the regexes. So, something like this:
regexes = {
'droid location': re.compile("droid_location_pattern")
'jedi name': re.compile("jedi_name_pattern")
'tatooine phone number': re.compile("tatooine_phone_pattern")
}
def analyze(some_string):
for key, regex in regexes:
data = re.findall(regex, some_string)
if data:
for data_item in data:
send_to_mysql(label=key, info=data_item)
EDIT:
Some sample strings are below.
Valid numbers will have the pattern: 9XXXX, which may also be written as 9XXX-X
I don't want to match the number 92222:
[Darth Vader]: Hey babe, I'm chilling in the Death Star. Where are you?
[Padme Amidala]: At the Galactic Senate, can't talk.
[Darth Vader]: Netflix and chill?
[Padme Amidala]: Call me later on my burner phone, the number is: 92222.
Here, I want to match the number 97777, because the string contains r2-d2:
[communique yoda:palpatine] spotted luke skywalker i have.
[communique yoda:palpatine] with the droid he is. r2-d2 we must kill.
[communique yoda:palpatine] location 97777 you must go.
Another possible match because the string contains c-3po:
root#palpatine$ at-at start --target c-3po --location 9777-7
AT-AT startup sequence...
[Error] fuel reserves low, aborting startup. Goodbye.
Don't want to match:
https://members.princessleiapics.com?username=stormtrooper&password=96969
Well, this highly depends on your actual strings. Assuming that c-3po or r2-d2 will always be before the desired location number (am I correct here?) you could use for both your examples the following regex:
(?:c-3po|r2-d2)(?=.*\b(9\d\d\d-?\d)\b)
# looks for c-3po or r2-d2 literally
# start a positive lookahead
# which consumes every character zero or unlimited times
# looks for a word boundary
# and captures a five digit number with or without a dash
# looks for a word boundary afterwards and close the lookahead
Be aware that this only works in DOTALL mode (aka the dot matches newline characters as well). See a working demo on regex101 here (copy and paste your other strings to confirm the examples are working).
Additionaly thoughts: It might be better though to check if the strings c-3po or r2-d2 occur in the chunks using normal python string functions and if so try to match the desired location number with the following regex:
\b(9\d\d\d-?\d)\b
# same as above without the lookahead

Find email domain in address with regular expressions

I know I'm an idiot, but I can't pull the domain out of this email address:
'blahblah#gmail.com'
My desired output:
'#gmail.com'
My current output:
.
(it's just a period character)
Here's my code:
import re
test_string = 'blahblah#gmail.com'
domain = re.search('#*?\.', test_string)
print domain.group()
Here's what I think my regular expression says ('#*?.', test_string):
' # begin to define the pattern I'm looking for (also tell python this is a string)
# # find all patterns beginning with the at symbol ("#")
* # find all characters after ampersand
? # find the last character before the period
\ # breakout (don't use the next character as a wild card, us it is a string character)
. # find the "." character
' # end definition of the pattern I'm looking for (also tell python this is a string)
, test string # run the preceding search on the variable "test_string," i.e., 'blahblah#gmail.com'
I'm basing this off the definitions here:
http://docs.activestate.com/komodo/4.4/regex-intro.html
Also, I searched but other answers were a bit too difficult for me to get my head around.
Help is much appreciated, as usual. Thanks.
My stuff if it matters:
Windows 7 Pro (64 bit)
Python 2.6 (64 bit)
PS. StackOverflow quesiton: My posts don't include new lines unless I hit "return" twice in between them. For example (these are all on a different line when I'm posting):
# - find all patterns beginning with the at symbol ("#")
* - find all characters after ampersand
? - find the last character before the period
\ - breakout (don't use the next character as a wild card, us it is a string character)
. - find the "." character
, test string - run the preceding search on the variable "test_string," i.e., 'blahblah#gmail.com'
That's why I got a blank line b/w every line above. What am I doing wrong? Thx.
Here's something I think might help
import re
s = 'My name is Conrad, and blahblah#gmail.com is my email.'
domain = re.search("#[\w.]+", s)
print domain.group()
outputs
#gmail.com
How the regex works:
# - scan till you see this character
[\w.] a set of characters to potentially match, so \w is all alphanumeric characters, and the trailing period . adds to that set of characters.
+ one or more of the previous set.
Because this regex is matching the period character and every alphanumeric after an #, it'll match email domains even in the middle of sentences.
Ok, so why not use split? (or partition )
"#"+'blahblah#gmail.com'.split("#")[-1]
Or you can use other string methods like find
>>> s="bal#gmail.com"
>>> s[ s.find("#") : ]
'#gmail.com'
>>>
and if you are going to extract out email addresses from some other text
f=open("file")
for line in f:
words= line.split()
if "#" in words:
print "#"+words.split("#")[-1]
f.close()
Using regular expressions:
>>> re.search('#.*', test_string).group()
'#gmail.com'
A different way:
>>> '#' + test_string.split('#')[1]
'#gmail.com'
You can try using urllib
from urllib import parse
email = 'myemail#mydomain.com'
domain = parse.splituser(email)[1]
Output will be
'mydomain.com'
Just wanted to point out that chrisaycock's method would match invalid email addresses of the form
herp#
to correctly ensure you're just matching a possibly valid email with domain you need to alter it slightly
Using regular expressions:
>>> re.search('#.+', test_string).group()
'#gmail.com'
Using the below regular expression you can extract any domain like .com or .in.
import re
s = 'my first email is user1#gmail.com second email is enter code hereuser2#yahoo.in and third email is user3#outlook.com'
print(re.findall('#+\S+[.in|.com|]',s))
output
['#gmail.com', '#yahoo.in']
Here is another method using the index function:
email_addr = 'blahblah#gmail.com'
# Find the location of # sign
index = email_addr.index("#")
# extract the domain portion starting from the index
email_domain = email_addr[index:]
print(email_domain)
#------------------
# Output:
#gmail.com

Categories