Python Regular Expression findall with variable - python

I am trying to use re.findall with look-behind and look-forward to extract data. The regular expression works fine when I am not using a raw_input variable, but I need users to be able to input a variety of different search terms.
Here is the current code:
me = re.findall(r"(?<='(.+)'+variable+'(.+)')(.*?)(?='(.+)+variable+(.+)')", raw)
As you can see, I am trying to pull out strings between one search term.
However, each time I use this type of formatting, I get a fixed width error. Is there anyway around this?
I have also tried the following formats with no success.
variable = raw_input('Term? ')
'.*' + variable + '.*'
and
'.*%s.*' % (variable, )
and
'.*{0}.*'.format(variable)
and
'.*{variable}.*'.format(variable=variable)

I'm not sure if this is what you mean, but it may get you started. As far as I understand your question, you don't need lookaheads or lookbehinds. This is for Python 2.x (won't work with Python 3):
>>> import re
>>> string_to_search = 'fish, hook, swallowed, reeled, boat, fish'
>>> entered_by_user = 'fish'
>>> search_regex = r"{0}(.+){0}".format(entered_by_user)
>>> match = re.search(search_regex, string_to_search)
>>> if match:
... print "result:", match.group(1).strip(' ,')
...
result: hook, swallowed, reeled, boat
If you really want the last 'fish' in the result as in your comment above, then just remove the second {0} from the format() string.

This solution should work:
me = re.findall(rf"(?<='(.+)'+{variable}+'(.+)')(.*?)(?='(.+)+{variable}+(.+)')", raw)
You also can add many different variables as you wish.
Add rf for the regular expression and the desired variables in between {}
import re
text = "regex is the best"
var1 = "is the"
var2 = "best"
yes = re.findall(rf"regex {var1} {var2}", text)
print(yes)
['regex is the best']

The way lookbehind is usually implemented (including its Python implementation) has an inherent limitation that you are unfortunately running into: lookbehinds cannot be variable-length. The "Important Notes About Lookbehind" section here explains why. I think you should be able to do the regex without a lookbehind, though.

Related

Python 3.6 Identifying a string and if X in Y

Newb programmer here working on my first project. I've searched this site and the python documentation, and either I'm not seeing the answer, or I'm not using the right terminology. I've read the regex and if sections, specifically, and followed links around to other parts that seemed relevant.
import re
keyphrase = '##' + '' + '##'
print(keyphrase) #output is ####
j = input('> ') ###whatever##
if keyphrase in j:
print('yay')
else:
print('you still haven\'t figured it out...')
k = j.replace('#', '')
print(k) #whatever
This is for a little reddit bot project. I want the bot to be called like ##whatever## and then be able to do things with the word(s) in between the ##'s. I've set up the above code to test if Python was reading it but I keep getting my "you still haven't figured it out..." quip.
I tried adding the REGEX \W in the middle of keyphrase, to no avail. Also weird combinations of \$\$ and quotes
So, my question, is how do I put a placeholder in keyphrase for user input?
For instance, if a ##comment## does something like ##this## ##I can grab## everything between the # symbols as separate inputs/calls.
You could use the following regex r'##(.*?)##' to capture everything inside of the key phrase you've chosen.
Sample Output:
>>> import re
>>> f = lambda s: re.match(r'##(.*?)##', s).group(1)
>>> f("##whatever##")
whatever
>>> f = lambda s: re.findall(r'##(.*?)##', s)
>>> f("a ##comment## does something like ##this## ##I can grab## everything between the # symbols as separate inputs/calls.")
['comment', 'this', 'I can grab']
How does it work? (1) We state the string constant head and tail for the capture group 1 between the brackets (). Great, almost there! (2) We then match any character .*? with greedy search enforced so that we capture the whole string.
Suggested Readings:
Introduction to Regex in Python - Jee Gikera
Something like this should work:
import re
keyphrase_regex = re.compile(r'##(.*)##')
user_input = input('> ')
keyphrase_match = keyphrase_regex.search(user_input)
# `search` returns `None` if regex didn't match anywhere in the string
keyphrase_content = keyphrase_match.group(1) if keyphrase_match else None
if keyphrase_content:
keyphrase_content = keyphrase_match.group(1)
print('yay! You submitted "', keyphrase_content, '" to the bot!')
else:
# Bonus tip: Use double quotes to make a string containing apostrophe
# without using a backslash escape
print("you still haven't figured it out...")
# Use `keyphrase_content` for whatever down here
Regular expressions are kind of hard to wrap your head around, because they work differently than most programming constructs. It's a language to describe patterns.
Regex One is a fantastic beginners guide.
Regex101 is an online sandbox that allows you to type a regular expression and some sample strings, then see what matches (and why) in real time
The regex ##(.*)## basically means "search through the string until you find two '#' signs. Right after those, start capturing zero-or-more of any character. If you find another '#', stop capturing characters. If that '#' is followed by another one, stop looking at the string, return successfully, and hold onto the entire match (from first '#' to last '#'). Also, hold onto the captured characters in case the programmer asks you for just them.
EDIT: Props to #ospahiu for bringing up the ? lazy quantifier. A final solution, combining our approaches, would look like this:
# whatever_bot.py
import re
# Technically, Python >2.5 will compile and cache regexes automatically.
# For tiny projects, it shouldn't make a difference. I think it's better style, though.
# "Explicit is better than implicit"
keyphrase_regex = re.compile(r'##(.*?)##')
def parse_keyphrases(input):
return keyphrase_regex.find_all(input)
Lambdas are cool. I prefer them for one-off things, but the code above is something I'd rather put in a module. Personal preference.
You could even make the regex substitutable, using the '##' one by default
# whatever_bot.py
import re
keyphrase_double_at_sign = re.compile(r'##(.*?)##')
def parse_keyphrases(input, keyphrase_regex=keyphrase_double_at_sign):
return keyphrase_regex.find_all(input)
You could even go bonkers and write a function that generates a keyphrase regex from an arbitrary "tag" pattern! I'll leave that as an exercise for the reader ;) Just remember: Several characters have special regex meanings, like '*' and '?', so if you want to match that literal character, you'd need to escape them (e.g. '\?').
If you want to grab the content between the "#", then try this:
j = input("> ")
"".join(j.split("#"))
You're not getting any of the info between the #'s in your example because you're effectively looking for '####' in whatever input you give it. Unless you happen to put 4 #'s in a row, that RE will never match.
What you want to do instead is something like
re.match('##\W+##', j)
which will look for 2 leading ##s, then any number greater than 1 alphanumeric characters (\W+), then 2 trailing ##s. From there, your strip code looks fine and you should be able to grab it.

How can i ignore the characters between brackets?

Example:
The hardcoded input in the system:
Welcome to work {sarah} have a great {monday}!
The one i get from an api call might differ by the day of the week or the name example:
Welcome to work Roy have a great Tuesday!
I want to compare these 2 lines and give an error if anything but the terms in brackets doesn't match.
The way I started is by using assert which is the exact function I need then tested with ignoring a sentence if it starts with { by using .startswith() but I haven't been successful working my way in specifics between the brackets that I don't want them checked.
Regular expressions are good for matching text.
Convert your template into a regular expression, using a regular expression to match the {} tags:
>>> import re
>>> template = 'Welcome to work {sarah} have a great {monday}!'
>>> pattern = re.sub('{[^}]*}', '(.*)', template)
>>> pattern
'Welcome to work (.*) have a great (.*)!'
To make sure the matching halts at the end of the pattern, put a $:
>>> pattern += '$'
Then match your string against the pattern:
>>> match = re.match(pattern, 'Welcome to work Roy have a great Tuesday!')
>>> match.groups()
('Roy', 'Tuesday')
If you try matching a non-matching string you get nothing:
>>> match = re.match(pattern, 'I wandered lonely as a cloud')
>>> match is None
True
If the start of the string matches but the end doesn't, the $ makes sure it doesn't match. The $ says "end here":
>>> match = re.match(pattern, 'Welcome to work Roy have a great one! <ignored>')
>>> match is None
True
edit: also you might want to escape your input in case anyone's playing silly beggars.
You can make copies that do not include anything that has brackets around it and compare those. It is relatively easy with regular expressions. As a function, it could look like this:
import re
# compare two strings, ignoring everything that has curly brackets around it
def compare_without_brackets(s_1, s_2, p=re.compile(r"{.*?}")):
return p.sub('', s_1) == p.sub('', s_2)
# example
first = 'Welcome to work {sarah} have a great {monday}!'
second = 'Welcome to work {michael} have a great {tuesday}!'
print(compare_without_brackets(first, second))
>> True
edit: reworked my answer after seeing I got something wrong. It works now in a way that everything with curly brackets around it is replaced with a universal match. Now you can compare the hardcoded version with any returned from the API and still get either True or False, depending on whether they match or not.
import re
# compare a hardcoded string with curly braces with one returned from the API
def compare_without_brackets(hardcoded, from_API, p=re.compile(r"{.*?}")):
pattern = re.compile(p.sub(r'(.*)', hardcoded))
return pattern.match(from_API) is not None
# example
first = 'Welcome to work {sarah} have a great {monday}!'
second = 'Welcome to work michael have a great tuesday!'
print(compare_without_brackets(first, second))
>>>> True

Python: check if string meets specific format

Programming in Python3.
I am having difficulty in controlling whether a string meets a specific format.
So, I know that Python does not have a .contain() method like Java but that we can use regex.
My code hence will probably look something like this, where lowpan_headers is a dictionary with a field that is a string that should meet a specific format.
So the code will probably be like this:
import re
lowpan_headers = self.converter.lowpan_string_to_headers(lowpan_string)
pattern = re.compile("^([A-Z][0-9]+)+$")
pattern.match(lowpan_headers[dest_addrS])
However, my issue is in the format and I have not been able to get it right.
The format should be like bbbb00000000000000170d0000306fb6, where the first 4 characters should be bbbb and all the rest, with that exact length, should be hexadecimal values (so from 0-9 and a-f).
So two questions:
(1) any easier way of doing this except through importing re
(2) If not, can you help me out with the regex?
As for the regex you're looking for I believe that
^bbbb[0-9a-f]{28}$
should validate correctly for your requirements.
As for if there is an easier way than using the re module, I would say that there isn't really to achieve the result you're looking for. While using the in keyword in python works in the way you would expect a contains method to work for a string, you are actually wanting to know if a string is in a correct format. As such the best solution, as it is relatively simple, is to use a regular expression, and thus use the re module.
Here is a solution that does not use regex:
lowpan_headers = 'bbbb00000000000000170d0000306fb6'
if lowpan_headers[:4] == 'bbbb' and len(lowpan_headers) == 32:
try:
int(lowpan_headers[4:], 16) # tries interpreting the last 28 characters as hexadecimal
print('Input is valid!')
except ValueError:
print('Invalid Input') # hex test failed!
else:
print('Invalid Input') # either length test or 'bbbb' prefix test failed!
In fact, Python does have an equivalent to the .contains() method. You can use the in operator:
if 'substring' in long_string:
return True
A similar question has already been answered here.
For your case, however, I'd still stick with regex as you're indeed trying to evaluate a certain String format. To ensure that your string only has hexadecimal values, i.e. 0-9 and a-f, the following regex should do it: ^[a-fA-F0-9]+$. The additional "complication" are the four 'b' at the start of your string. I think an easy fix would be to include them as follows: ^(bbbb)?[a-fA-F0-9]+$.
>>> import re
>>> pattern = re.compile('^(bbbb)?[a-fA-F0-9]+$')
>>> test_1 = 'bbbb00000000000000170d0000306fb6'
>>> test_2 = 'bbbb00000000000000170d0000306fx6'
>>> pattern.match(test_1)
<_sre.SRE_Match object; span=(0, 32), match='bbbb00000000000000170d0000306fb6'>
>>> pattern.match(test_2)
>>>
The part that is currently missing is checking for the exact length of the string for which you could either use the string length method or extend the regex -- but I'm sure you can take it from here :-)
As I mentioned in the comment Python does have contains() equivalent.
if "blah" not in somestring:
continue
(source) (PythonDocs)
If you would prefer to use a regex instead to validate your input, you can use this:
^b{4}[0-9a-f]{28}$ - Regex101 Demo with explanation

Python regex to extract substring at start and end of string

I am looking for a regex that will extract everything up to the first . (period) in a string, and everything including and after the last . (period)
For example:
my_file.10.4.5.6.csv
myfile2.56.3.9.txt
Ideally the regex when run against these strings would return:
my_file.csv
myfile2.txt
The numeric stamp in the file will be different each time the script is run, so I am looking essentially to exclude it.
The following prints out the string up to the first . (period)
print re.search("^[^.]*", data_file).group(0)
I am having trouble though getting it to also return the the last period and string after it.
Sorry just to update this based upon feedback and comments below:
This does need to be a regex. The regex will be passed into the program from a configuration file. The user will not have access to the source code as it will be packaged.
The user may need to change the regex based upon some arbitrary criteria, so they will need to update the config file, rather than edit the application and re-build the package.
Thanks
You don’t need a regular expression!
parts = data_file.split(".")
print parts[0] + "." + parts[-1]
Instead of regular expressions, I would suggest using str.split. For example:
>>> data_file = 'my_file.10.4.5.6.csv'
>>> parts = data_file.split('.')
>>> print parts[0] + '.' + parts[-1]
my_file.csv
However if you insist on regular expressions, here is one approach:
>>> print re.sub(r'\..*\.', '.', data_file)
my_file.csv
You don't need a regex.
tokens = expanded_name.split('.')
compressed_name = '.'.join((tokens[0], tokens[-1]))
If you are concerned about performance, you could use a length limit and rsplit() to only chop up the string as much as you need.
compressed_name = expanded_name.split('.', 1)[0] + '.' + expanded_name.rsplit('.', 1)[1]
Do you need a regex here?
>>> address = "my_file.10.4.5.6.csv"
>>> split_by_periods = address.split(".")
>>> "{}.{}".format(address[0], address[-1])
>>> "my_file.csv"

python regex on variable

Please help with my regex problem
Here is my string
source="http://www.amazon.com/ref=s9_hps_bw_g200_t2?pf_rd_m=ATVPDKIKX0DER&pf_rd_i=3421"
source_resource="pf_rd_m=ATVPDKIKX0DER"
The source_resource is in the source may end with & or with .[for example].
So far,
regex = re.compile("pf_rd_m=ATVPDKIKX0DER+[&.]")
regex.findall(source)
[u'pf_rd_m=ATVPDKIKX0DER&']
I have used the text here. Rather using text, how can i use source_resource variable with & or . to find this out.
If the goal is to extract the pf_rd_m value (which it apparently is as you are using regex.findall), than I'm not sure regex are the easiest solution here:
>>> import urlparse
>>> qs = urlparse.urlparse(source).query
>>> urlparse.parse_qs(qs)
{'pf_rd_m': ['ATVPDKIKX0DER'], 'pf_rd_i': ['3421']}
>>> urlparse.parse_qs(qs)['pf_rd_m']
['ATVPDKIKX0DER']
You also have to escape the .
pattern=re.compile(source_resource + '[&\.]')
You can just build the string for the regular expression like a normal string, utilizing all string-formatting options available in Python:
import re
source_and="http://rads.stackoverflow.com/amzn/click/B0030DI8NA/pf_rd_m=ATVPDKIKX0DER&"
source_dot="http://rads.stackoverflow.com/amzn/click/B0030DI8NA/pf_rd_m=ATVPDKIKX0DER."
source_resource="pf_rd_m=ATVPDKIKX0DER"
regex_string = source_resource + "[&\.]"
regex = re.compile(regex_string)
print regex.findall(source_and)
print regex.findall(source_dot)
>>> ['pf_rd_m=ATVPDKIKX0DER&']
['pf_rd_m=ATVPDKIKX0DER.']
I hope this is what you mean.
Just take note that I modified your regular expression: the . is a special symbol and needs to be escaped, as is the + (I just assumed the string will only occur once, which makes the use of + unnecessary).

Categories