Newb programmer here working on my first project. I've searched this site and the python documentation, and either I'm not seeing the answer, or I'm not using the right terminology. I've read the regex and if sections, specifically, and followed links around to other parts that seemed relevant.
import re
keyphrase = '##' + '' + '##'
print(keyphrase) #output is ####
j = input('> ') ###whatever##
if keyphrase in j:
print('yay')
else:
print('you still haven\'t figured it out...')
k = j.replace('#', '')
print(k) #whatever
This is for a little reddit bot project. I want the bot to be called like ##whatever## and then be able to do things with the word(s) in between the ##'s. I've set up the above code to test if Python was reading it but I keep getting my "you still haven't figured it out..." quip.
I tried adding the REGEX \W in the middle of keyphrase, to no avail. Also weird combinations of \$\$ and quotes
So, my question, is how do I put a placeholder in keyphrase for user input?
For instance, if a ##comment## does something like ##this## ##I can grab## everything between the # symbols as separate inputs/calls.
You could use the following regex r'##(.*?)##' to capture everything inside of the key phrase you've chosen.
Sample Output:
>>> import re
>>> f = lambda s: re.match(r'##(.*?)##', s).group(1)
>>> f("##whatever##")
whatever
>>> f = lambda s: re.findall(r'##(.*?)##', s)
>>> f("a ##comment## does something like ##this## ##I can grab## everything between the # symbols as separate inputs/calls.")
['comment', 'this', 'I can grab']
How does it work? (1) We state the string constant head and tail for the capture group 1 between the brackets (). Great, almost there! (2) We then match any character .*? with greedy search enforced so that we capture the whole string.
Suggested Readings:
Introduction to Regex in Python - Jee Gikera
Something like this should work:
import re
keyphrase_regex = re.compile(r'##(.*)##')
user_input = input('> ')
keyphrase_match = keyphrase_regex.search(user_input)
# `search` returns `None` if regex didn't match anywhere in the string
keyphrase_content = keyphrase_match.group(1) if keyphrase_match else None
if keyphrase_content:
keyphrase_content = keyphrase_match.group(1)
print('yay! You submitted "', keyphrase_content, '" to the bot!')
else:
# Bonus tip: Use double quotes to make a string containing apostrophe
# without using a backslash escape
print("you still haven't figured it out...")
# Use `keyphrase_content` for whatever down here
Regular expressions are kind of hard to wrap your head around, because they work differently than most programming constructs. It's a language to describe patterns.
Regex One is a fantastic beginners guide.
Regex101 is an online sandbox that allows you to type a regular expression and some sample strings, then see what matches (and why) in real time
The regex ##(.*)## basically means "search through the string until you find two '#' signs. Right after those, start capturing zero-or-more of any character. If you find another '#', stop capturing characters. If that '#' is followed by another one, stop looking at the string, return successfully, and hold onto the entire match (from first '#' to last '#'). Also, hold onto the captured characters in case the programmer asks you for just them.
EDIT: Props to #ospahiu for bringing up the ? lazy quantifier. A final solution, combining our approaches, would look like this:
# whatever_bot.py
import re
# Technically, Python >2.5 will compile and cache regexes automatically.
# For tiny projects, it shouldn't make a difference. I think it's better style, though.
# "Explicit is better than implicit"
keyphrase_regex = re.compile(r'##(.*?)##')
def parse_keyphrases(input):
return keyphrase_regex.find_all(input)
Lambdas are cool. I prefer them for one-off things, but the code above is something I'd rather put in a module. Personal preference.
You could even make the regex substitutable, using the '##' one by default
# whatever_bot.py
import re
keyphrase_double_at_sign = re.compile(r'##(.*?)##')
def parse_keyphrases(input, keyphrase_regex=keyphrase_double_at_sign):
return keyphrase_regex.find_all(input)
You could even go bonkers and write a function that generates a keyphrase regex from an arbitrary "tag" pattern! I'll leave that as an exercise for the reader ;) Just remember: Several characters have special regex meanings, like '*' and '?', so if you want to match that literal character, you'd need to escape them (e.g. '\?').
If you want to grab the content between the "#", then try this:
j = input("> ")
"".join(j.split("#"))
You're not getting any of the info between the #'s in your example because you're effectively looking for '####' in whatever input you give it. Unless you happen to put 4 #'s in a row, that RE will never match.
What you want to do instead is something like
re.match('##\W+##', j)
which will look for 2 leading ##s, then any number greater than 1 alphanumeric characters (\W+), then 2 trailing ##s. From there, your strip code looks fine and you should be able to grab it.
I have a long string like this:
'[("He tended to be helpful, enthusiastic, and encouraging, even to studentsthat didn\'t have very much innate talent.\\n",), (\'Great instructor\\n\',), (\'He could always say something nice and was always helpful.\\n\',), (\'He knew what he was doing.\\n\',), (\'Likes art\\n\',), (\'He enjoys the classwork.\\n\',), (\'Good discussion of ideas\\n\',), (\'Open-minded\\n\',), (\'We learned stuff without having to take notes, we just applied it to what we were doing; made it an interesting and fun class.\\n\',), (\'Very kind, gave good insight on assignments\\n\',), (\' Really pushed me in what I can do; expanded how I thought about art, the materials used, and how it was visually.\\n\',)
and I want to remove all [, (, ", \, \n from this string at once. Somehow I can do it one by one, but always failed with '\n'. Is there any efficient way I can remove or translate all these characters or blank lines symbols?
Since my senectiecs are not long so I do not want to use dictionary methods like earlier questions.
Maybe you could use regex to find all the characters that you want to replace
s = s.strip()
r = re.compile("\[|\(|\)|\]|\\|\"|'|,")
s = re.sub(r, '', s)
print s.replace("\\n", "")
I have some problems with the "\n" but replacing after the regex is easy to remove too.
If string is correct python expression then you can use literal_eval from ast module to transform string to tuples and after that you can process every tuple.
from ast import literal_eval
' '.join(el[0].strip() for el in literal_eval(your_string))
If not then you can use this:
def get_part_string(your_string):
for part in re.findall(r'\((.+?)\)', your_string):
yield re.sub(r'[\"\'\\\\n]', '', part).strip(', ')
''.join(get_part_string(your_string))
I was wondering if any of the following exist in python:
A: non-regex equivalent of "re.findall()".
B: a way of neutralizing regex special characters in a variable before passing to findall().
I am passing a variable to re.findall which runs into problems when the variable has a period or a slash or a carat etc because I would like these characters to be interpreted literally. I realize it is not necessary to use regex to do this job, but I like the behavior of re.findall() because it returns a list of every match it finds. This allows me to easily count how many times the substring exists by using len().
Here's an example of my code:
>>substring_matches = re.findall(randomVariableOfCharacters, document_to_be_searched)
>>
>>#^^ will return something like ['you', 'you', 'you']
>>#but could also return something like ['end.', 'end.', 'ends']
>>#if my variable is 'end.' because "." is a wildcard.
>>#I would rather it return ['end.', 'end.']
>>
>>occurrences_of_substring = len(substring_matches)
I'm hoping to not have to use string.find(), if possible. Any help and/or advice is greatly appreciated!
You can use str.count() if you only want the number of occurances, but its not equivalent to re.findall() it only gets the count.
document_to_be_searched = "blabla bla bla."
numOfOcur = document_to_be_searched.count("bl")
Sure: looking at your code, I think that you're looking for is string.count.
>>> 'abcdabc'.count('abc')
2
Note that however, this is not an equivalent to re.findall; although it looks more appropriate in your case.
I am trying to capture sub-strings from a string that looks similar to
'some string, another string, '
I want the result match group to be
('some string', 'another string')
my current solution
>>> from re import match
>>> match(2 * '(.*?), ', 'some string, another string, ').groups()
('some string', 'another string')
works, but is not practicable - what I am showing here of course is massively reduced in terms of complexity compared to what I'm doing in the real project; I want to use one 'straight' (non-computed) regex pattern only. Unfortunately, my attempts have failed so far:
This doesn't match (None as result), because {2} is applied to the space only, not to the whole string:
>>> match('.*?, {2}', 'some string, another string, ')
adding parentheses around the repeated string has the comma and space in the result
>>> match('(.*?, ){2}', 'some string, another string, ').groups()
('another string, ',)
adding another set of parantheses does fix that, but gets me too much:
>>> match('((.*?), ){2}', 'some string, another string, ').groups()
('another string, ', 'another string')
adding a non-capturing modifier improves the result, but still misses the first string
>>> match('(?:(.*?), ){2}', 'some string, another string, ').groups()
('another string',)
I feel like I'm close, but I can't really seem to find the proper way.
Can anyone help me ? Any other approaches I'm not seeing ?
Update after the first few responses:
First up, thank you very much everyone, your help is greatly appreciated! :-)
As I said in the original post, I have omitted a lot of complexity in my question for the sake of depicting the actual core problem. For starters, in the project I am working on, I am parsing large amounts of files (currently tens of thousands per day) in a number (currently 5, soon ~25, possibly in the hundreds later) of different line-based formats. There is also XML, JSON, binary and some other data file formats, but let's stay focussed.
In order to cope with the multitude of file formats and to exploit the fact that many of them are line-based, I have created a somewhat generic Python module that loads one file after the other, applies a regex to every line and returns a large data structure with the matches. This module is a prototype, the production version will require a C++ version for performance reason which will be connected over Boost::Python and will probably add the subject of regex dialects to the list of complexities.
Also, there are not 2 repetitions, but an amount varying between currently zero and 70 (or so), the comma is not always a comma and despite what I said originally, some parts of the regex pattern will have to be computed at runtime; let's just say I have reason to try and reduce the 'dynamic' amount and have as much 'fixed' pattern as possible.
So, in a word: I must use regular expressions.
Attempt to rephrase: I think the core of the problem boils down to: Is there a Python RegEx notation that e.g. involves curly braces repetitions and allows me to capture
'some string, another string, '
into
('some string', 'another string')
?
Hmmm, that probably narrows it down too far - but then, any way you do it is wrong :-D
Second attempt to rephrase: Why do I not see the first string ('some string') in the result ? Why does the regex produce a match (indicating there's gotta be 2 of something), but only returns 1 string (the second one) ?
The problem remains the same even if I use non-numeric repetition, i.e. using + instead of {2}:
>>> match('(?:(.*?), )+', 'some string, another string, ').groups()
('another string',)
Also, it's not the second string that's returned, it is the last one:
>>> match('(?:(.*?), )+', 'some string, another string, third string, ').groups()
('third string',)
Again, thanks for your help, never ceases to amaze me how helpful peer review is while trying to find out what I actually want to know...
Unless there's much more to this problem than you've explained, I don't see the point in using regexes. This is very simple to deal with using basic string methods:
[s.strip() for s in mys.split(',') if s.strip()]
Or if it has to be a tuple:
tuple(s.strip() for s in mys.split(',') if s.strip())
The code is more readable too. Please tell me if this fails to apply.
EDIT: Ok, there is indeed more to this problem than it initially seemed. Leaving this for historical purposes though. (Guess I'm not 'disciplined' :) )
As described, I think this regex works fine:
import re
thepattern = re.compile("(.+?)(?:,|$)") # lazy non-empty match
thepattern.findall("a, b, asdf, d") # until comma or end of line
# Result:
Out[19]: ['a', ' b', ' asdf', ' d']
The key here is to use findall rather than match. The phrasing of your question suggests you prefer match, but it isn't the right tool for the job here -- it is designed to return exactly one string for each corresponding group ( ) in the regex. Since your 'number of strings' is variable, the right approach is to use either findall or split.
If this isn't what you need, then please make the question more specific.
Edit: And if you must use tuples rather than lists:
tuple(Out[19])
# Result
Out[20]: ('a', ' b', ' asdf', ' d')
import re
regex = " *((?:[^, ]| +[^, ])+) *, *((?:[^, ]| +[^, ])+) *, *"
print re.match(regex, 'some string, another string, ').groups()
# ('some string', 'another string')
print re.match(regex, ' some string, another string, ').groups()
# ('some string', 'another string')
print re.match(regex, ' some string , another string, ').groups()
# ('some string', 'another string')
No offense, but you obviously have a lot to learn about regexes, and what you're going to learn, ultimately, is that regexes can't handle this job. I'm sure this particular task is doable with regexes, but then what? You say you have potentially hundreds of different file formats to parse! You even mentioned JSON and XML, which are fundamentally incompatible with regexes.
Do yourself a favor: forget about regexes and learn pyparsing instead. Or skip Python entirely and use a standalone parser generator like ANTLR. In either case, you'll probably find that grammars for most of your file formats have already been written.
I think the core of the problem boils
down to: Is there a Python RegEx
notation that e.g. involves curly
braces repetitions and allows me to
capture 'some string, another string,
' ?
I don't think there is such a notation.
But regexes are not a matter of only NOTATION , that is to say the RE string used to define a regex. It is also a matter of TOOLS, that is to say functions.
Unfortunately, I can't use findall as
the string from the initial question
is only a part of the problem, the
real string is a lot longer, so
findall only works if I do multiple
regex findalls / matches / searches.
You should give more information without delaying: we could understand more rapidly what are the constraints. Because in my opinion, to answer to your problem as it has been exposed, findall() is indeed OK:
import re
for line in ('string one, string two, ',
'some string, another string, third string, ',
# the following two lines are only one string
'Topaz, Turquoise, Moss Agate, Obsidian, '
'Tigers-Eye, Tourmaline, Lapis Lazuli, '):
print re.findall('(.+?), *',line)
Result
['string one', 'string two']
['some string', 'another string', 'third string']
['Topaz', 'Turquoise', 'Moss Agate', 'Obsidian', 'Tigers-Eye', 'Tourmaline', 'Lapis Lazuli']
Now, since you "have omitted a lot of complexity" in your question, findall() could incidentally be unsufficient to hold this complexity. Then finditer() will be used because it allows more flexibility in the selection of groups of a match
import re
for line in ('string one, string two, ',
'some string, another string, third string, ',
# the following two lines are only one string
'Topaz, Turquoise, Moss Agate, Obsidian, '
'Tigers-Eye, Tourmaline, Lapis Lazuli, '):
print [ mat.group(1) for mat in re.finditer('(.+?), *',line) ]
gives the same result and can be complexified by writing other expression in place of mat.group(1)
In order to sum this up, it seems I am already using the best solution by constructing the regex pattern in a 'dynamic' manner:
>>> from re import match
>>> match(2 * '(.*?), ', 'some string, another string, ').groups()
('some string', 'another string')
the
2 * '(.*?)
is what I mean by dynamic. The alternative approach
>>> match('(?:(.*?), ){2}', 'some string, another string, ').groups()
('another string',)
fails to return the desired result due to the fact that (as Glenn and Alan kindly explained)
with match, the captured content gets overwritten
with each repetition of the capturing
group
Thanks for your help everyone! :-)