Requirement:using regex want to fetch only specific strings i.e. string betwee "-" and "*" symbols from input list. Below is the code snippet
ZTon = ['one-- and preferably only one --obvious', " Hello World", 'Now is better than never.', 'Although never is often better than *right* now.']
ZTon = [ line.strip() for line in ZTon]
print (ZTon)
r = re.compile(".^--")
portion = list(filter(r.match, ZTon)) # Read Note
print (portion)
Expected response:
['and preferably only one','right']
Using regex
import re
ZTon = ['one-- and preferably only one --obvious', " Hello World", 'Now is better than never.', 'Although never is often better than *right* now.']
pattern=r'(--|\*)(.*)\1'
l=[]
for line in ZTon:
s=re.search(pattern,line)
if s:l.append(s.group(2).strip())
print (l)
# ['and preferably only one', 'right']
import re
ZTon = ['one-- and preferably only one --obvious', " Hello World", 'Now is better than never.', 'Although never is often better than *right* now.']
def gen(lst):
for s in lst:
s = ''.join(i.strip() for g in re.findall(r'(?:-([^-]+)-)|(?:\*([^*]+)\*)', s) for i in g)
if s:
yield s
print(list(gen(ZTon)))
Prints:
['and preferably only one', 'right']
Related
I'm looking for a package or any other approach (other than manual replacement) for the templates within string formatting.
I want to achieve something like this (this is just an example so you could get the idea, not the actual working code):
text = "I {what:like,love} {item:pizza,space,science}".format(what=2,item=3)
print(text)
So the output would be:
I love science
How can I achieve this? I have been searching but cannot find anything appropriate. Probably used wrong naming terms.
If there isnt any ready to use package around I would love to read some tips on the starting point to code this myself.
I think using list is sufficient since python lists are persistent
what = ["like","love"]
items = ["pizza","space","science"]
text = "I {} {}".format(what[1],items[2])
print(text)
output:
I love science
My be use a list or a tuple for what and item as both data types preserve insertion order.
what = ['like', 'love']
item = ['pizza', 'space', 'science']
text = "I {what} {item}".format(what=what[1],item=item[2])
print(text) # I like science
or even this is possible.
text = "I {what[1]} {item[2]}".format(what=what, item=item)
print(text) # I like science
Hope this helps!
Why not use a dictionary?
options = {'what': ('like', 'love'), 'item': ('pizza', 'space', 'science')}
print("I " + options['what'][1] + ' ' + options['item'][2])
This returns: "I love science"
Or if you wanted a method to rid yourself of having to reformat to accommodate/remove spaces, then incorporate this into your dictionary structure, like so:
options = {'what': (' like', ' love'), 'item': (' pizza', ' space', ' science'), 'fullstop': '.'}
print("I" + options['what'][0] + options['item'][0] + options['fullstop'])
And this returns: "I like pizza."
Since no one have provided an appropriate answer that answers my question directly, I decided to work on this myself.
I had to use double brackets, because single ones are reserved for the string formatting.
I ended up with the following class:
class ArgTempl:
def __init__(self, _str):
self._str = _str
def format(self, **args):
for k in re.finditer(r"{{(\w+):([\w,]+?)}}", self._str,
flags=re.DOTALL | re.MULTILINE | re.IGNORECASE):
key, replacements = k.groups()
if not key in args:
continue
self._str = self._str.replace(k.group(0), replacements.split(',')[args[key]])
return self._str
This is a primitive, 5 minute written code, therefore lack of checks and so on. It works as expected and can be improved easly.
Tested on Python 2.7 & 3.6~
Usage:
test = "I {{what:like,love}} {{item:pizza,space,science}}"
print(ArgTempl(test).format(what=1, item=2))
> I love science
Thanks for all of the replies.
I am parsing a string that I know will definitely only contain the following distinct phrases that I want to parse:
'Man of the Match'
'Goal'
'Assist'
'Yellow Card'
'Red Card'
The string that I am parsing could contain everything from none of the elements above to all of them (i.e. the string being parsed could be anything from None to 'Man of the Match Goal Assist Yellow Card Red Card'.
For those of you that understand football, you will also realise that the elements 'Goal' and 'Assist' could in theory be repeated an infinite number of times. The element 'Yellow Card' could be repeated 0, 1 or 2 times also.
I have built the following Regex (where 'incident1' is the string being parsed), which I believed would return an unlimited number of all preceding Regexes, however all I am getting is single instances:
regex1 = re.compile("Man of the Match*", re.S)
regex2 = re.compile("Goal*", re.S)
regex3 = re.compile("Assist*", re.S)
regex4 = re.compile("Red Card*", re.S)
regex5 = re.compile("Yellow Card*", re.S)
mysearch1 = re.search(regex1, incident1)
mysearch2 = re.search(regex2, incident1)
mysearch3 = re.search(regex3, incident1)
mysearch4 = re.search(regex4, incident1)
mysearch5 = re.search(regex5, incident1)
#print mystring
print "incident1 = ", incident1
if mysearch1 is not None:
print "Man of the match = ", mysearch1.group()
if mysearch2 is not None:
print "Goal = ", mysearch2.group()
if mysearch3 is not None:
print "Assist = ", mysearch3.group()
if mysearch4 is not None:
print "Red Card = ", mysearch4.group()
if mysearch5 is not None:
print "Yellow Card = ", mysearch5.group()
This works as long as there is only one instance of every element encountered in a string, however if a player was for example to score more than one goal, this code only returns one instance of 'Goal'.
Can anyone see what I am doing wrong?
You can try something like this:
import re
s = "here's an example Man of the Match match and a Red Card match, and another Red Card match"
patterns = [
'Man of the Match',
'Goal',
'Assist',
'Yellow Card',
'Red Card',
]
repattern = '|'.join(patterns)
matches = re.findall(repattern, s, re.IGNORECASE)
print matches # ['Man of the Match', 'Red Card', 'Red Card']
Some general overview on regex methods in python:
re.search | re.match
In your previous attempt, you tried to use re.search. This only returned one result, and as you'll see this isn't unusual. These two functions are used to identify if a line contains a certain regex. You'd use these for something like:
s = subprocess.check_output('ipconfig') # calls ipconfig and sends output to s
for line in s.splitlines():
if re.search("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", str(line)):
# if line contains an IP address...
print(line)
You use re.match to specifically check if the regex matches at the BEGINNING of the string. This is usually used with a regex that matches the WHOLE string. For example:
lines = ['Adam Smith, Age: 24, Male, Favorite Thing: Reading page: 16',
'Adam Smith, Age: 16, Male, Favorite Thing: Being a regex example']
# two Adams, but we only want the one who is 16 years old.
repattern = re.compile(r'''Adam \w+, Age: 16, (?:Male|Female), Favorite Thing: [^,]*?''')
for line in lines:
if repattern.match(line):
print(line)
# Adam Smith, Age: 16, Male, Favorite Thing: Being a regex example
# note if we'd used re.search for Age: 16, it would have found both lines!
The take away is that you use these two functions to select lines in a longer document (or any iterable)
re.findall | re.finditer
It seems in this case, you aren't trying to match a line, you're trying to pull some specifically-formatted information from the string. Let's see some examples of that.
s = """Phone book:
Adam: (555)123-4567
Joe: (555)987-6543
Alice:(555)135-7924"""
pat = r'''(?:\(\d{3}\))?\d{3}-?\d{4}'''
phone_numbers = re.findall(pat, s)
print(phone_numbers)
# ['(555)123-4567','(555)987-6543','(555)135-7924']
re.finditer returns a generator instead of a list. You'd use this the same way you'd use xrange instead of range in Python2. re.findall(some_pattern, some_string) can make a GIANT list if there are a TON of matches. re.finditer will not.
other methods: re.split | re.sub
re.split is great if you have a number of things you need to split by. Imagine you had the string:
s = '''Hello, world! It's great that you're talking to me, and everything, but I'd really rather you just split me on punctuation marks. Okay?'''
There's no great way to do that with str.split like you're used to, so instead do:
separators = [".", "!", "?", ","]
splitpattern = '|'.join(map(re.escape, separators))
# re.escape takes a string and escapes out any characters that regex considers
# special, for instance that . would otherwise be "any character"!
split_s = re.split(splitpattern, s)
print(split_s)
# ['Hello', ' world', " It's great that you're talking to me", ' and everything', " but I'd really rather you just split me on punctuation marks", ' Okay', '']
re.sub is great in cases where you know something will be formatted regularly, but you're not sure exactly how. However, you REALLY want to make sure they're all formatted the same! This will be a little advanced and use several methods, but stick with me....
dates = ['08/08/2014', '09-13-2014', '10.10.1997', '9_29_09']
separators = list()
new_sep = "/"
match_pat = re.compile(r'''
\d{1,2} # two digits
(.) # followed by a separator (capture)
\d{1,2} # two more digits
\1 # a backreference to that separator
\d{2}(?:\d{2})? # two digits and optionally four digits''', re.X)
for idx,date in enumerate(dates):
match = match_pat.match(date)
if match:
sep = match.group(1) # the separator
separators.append(sep)
else:
dates.pop(idx) # this isn't really a date, is it?
repl_pat = '|'.join(map(re.escape, separators))
final_dates = re.sub(repl_pat, new_sep, '\n'.join(dates))
print(final_dates)
# 08/08/2014
# 09/13/2014
# 10/10/1997
# 9/29/09
A slightly less advanced example, you can use re.sub with any sort of formatted expression and pass it a function to return! For instance:
def get_department(dept_num):
departments = {'1': 'I.T.',
'2': 'Administration',
'3': 'Human Resources',
'4': 'Maintenance'}
if hasattr(dept_num, 'group'): # then it's a match, not a number
dept_num = dept_num.group(0)
return departments.get(dept_num, "Unknown Dept")
file = r"""Name,Performance Review,Department
Adam,3,1
Joe,5,2
Alice,1,3
Eve,12,4""" # this looks like a csv file
dept_names = re.sub(r'''\d+$''', get_department, file, flags=re.M)
print(dept_names)
# Name,Performance Review,Department
# Adam,3,I.T.
# Joe,5,Administration
# Alice,1,Human Resources
# Eve,12,Maintenance
Without using regex here you could do:
replaced_lines = []
departments = {'1': 'I.T.',
'2': 'Administration',
'3': 'Human Resources',
'4': 'Maintenance'}
for line in file.splitlines():
the_split_line = line.split(',')
replaced_lines.append(','.join(the_split_line[:-1]+ \
departments.get(the_split_line[-1], "Unknown Dept")))
new_file = '\n'.join(replaced_lines)
# LOTS OF STRING MANIPULATION, YUCK!
Instead we replace all that for loop and string splitting, list slicing, and string manipulation with a function and a re.sub call. In fact, if you use a lambda it's even easier!
departments = {'1': 'I.T.',
'2': 'Administration',
'3': 'Human Resources',
'4': 'Maintenance'}
re.sub(r'''\d+$''', lambda x: departments.get(x, "Unknown Dept"), file, flags=re.M)
# DONE!
I'm involved in a web project. I have to choose the best ways to represent the code, so that other people can read it without problems/headaches/whatever.
The "problem" I've tackled now is to show a nice formatted url (will be taken from a "title" string).
So, let's suppose we have a title, fetched from the form:
title = request.form['title'] # 'Hello World, Hello Cat! Hello?'
then we need a function to format it for inclusion in the url (it needs to become 'hello_world_hello_cat_hello'), so for the moment I'm using this one which I think sucks for readability:
str.replace(title, ' ', '-').str.replace(title, '!', '').str.replace(title, '?', '').str.replace(string, ',' '').lower()
What would be a good way to compact it? Is there already a function for doing what I'm doing?
I'd also like to know which characters/symbols I should strip from the url.
You can use urlencode() which is the way for url-encode strings in Python.
If otherwise you want a personalized encoding as your expected output and all you want to do is leave the words in the final string you can use the re.findall function to grab them and later join them with and underscore:
>>>s = 'Hello World, Hello Cat! Hello?'
>>>'_'.join(re.findall(r'\w+',s)).lower()
'hello_world_hello_cat_hello'
What this does is:
g = re.findall(r'\w+',s) # ['Hello', 'World', 'Hello', 'Cat', 'Hello']
s1 = '_'.join(g) # 'Hello_World_Hello_Cat_Hello'
s1.lower() # 'hello_world_hello_cat_hello'
This technique also works well with numbers in the string:
>>>s = 'Hello World, Hello Cat! H123ello? 123'
>>>'_'.join(re.findall(r'\w+',s)).lower()
'hello_world_hello_cat_h123ello_123'
Another way which I think should be faster is to actually replace non alphanumeric chars. This can be accomplished with re.sub by grabbing all the non alphanumerics toghether and replace them with _ like this:
>>>re.sub(r'\W+','_',s).lower()
'hello_world_hello_cat_h123ello_123'
Well... not really, speed tests:
$python -mtimeit -s "import re" -s "s='Hello World, Hello Cat! Hello?'" "'_'.join(re.findall(r'\w+',s)).lower()"
100000 loops, best of 3: 5.08 usec per loop
$python -mtimeit -s "import re" -s "s='Hello World, Hello Cat! Hello?'" "re.sub(r'\W+','_',s).lower()"
100000 loops, best of 3: 6.55 usec per loop
You could use urlencode() from the urllib module in python2 or urllib.parse module in python3.
This will work assuming you're trying to use the text in the query string of your URL.
title = {'title': 'Hello World, Hello Cat! Hello?'} # or get it programmatically as you did
encoded = urllib.urlencode(title)
print encoded # title=Hello+World%2C+Hello+Cat%21+Hello%3F
So I've been playing with all your answer's solutions and here's what I've come up with.
note: These "benchmarks" are not to be taken too seriously, as I didn't go through all the possible plans, but it's a good way to have a fast broad view.
re.findall()
def findall():
string = 'Hello World, Hello Cat! Hello?'
return '_'.join(re.findall(r'\w+',string)).lower()
real=0.019s, user=0.012s, sys=0.004s, rough=0.016s
re.sub()
def sub():
string = 'Hello World, Hello Cat! Hello?'
return re.sub(r'\W+','_',string).lower()
real=0.020s, user=0.016s, sys=0.004s, rough=0.020s
slugify()
def slug():
string = 'Hello World, Hello Cat! Hello?'
return slugify(string)
real=0.031s, user=0.024s, sys=0.004s, rough=0.028s
urllib.urlencode()
def urlenc():
string = {'title': 'Hello World, Hello Cat! Hello?'}
return urllib.urlencode(string)
real=0.036s, user=0.024s, sys=0.008s, rough=0.032s
As you can see, the fastest is re.findall(), the slowest urllib.urlencode() and in the middle there's slugify() which is also the shortest/cleanest of them all (altough not the fastest).
What I've chosen for now is Slugify, the lucky cat in between the bulldogs.
import re
re.sub(r'!|\?|,', '', text)
This will remove ! ? and , from the string.
I mean you could split it up into multiple statements:
str = str.replace(title, ' ', '-')
str = str.replace(title, '!', '')
str = str.replace(title, '?', '')
str = str.replace(string, ',' '')
str = str.lower()
This will make for better readability.
sure you can do this:
import string
uppers = string.ascii_uppercase # ABC...Z
lowers = string.ascii_lowercase # abc...z
removals = ''.join([ch for ch in string.punctuation if ch != '_'])
transtable = str.maketrans(uppers+" ",lowers+"_",removals)
title = "Hello World, Hello Cat! Hello?"
title.translate(transtable)
You could also do a list comp and ''.join it.
whitelist = string.ascii_uppercase + string.ascii_lowercase + " "
newtitle = ''.join('_' if ch == ' ' else ch.lower() for ch in title if ch in
whitelist)
So what i am trying to do is to have an input field named a. Then have a line of regex which checks a for 'i am (something)' (note something could be a chain of words.) and then prints How long have you been (something)?
This is my code so far:
if re.findall(r"i am", a):
print('How long have you been {}'.format(re.findall(r"i am", a)))
But this returns me a list of [i, am] not the (something). How do i get it to return me (something?)
Thanks,
A n00b at Python
Do you mean something like this?
>>> import re
>>> a = "I am a programmer"
>>> reg = re.compile(r'I am (.*?)$')
>>> print('How long have you been {}'.format(*reg.findall(a)))
How long have you been a programmer
r'I am (.*?)$' matches I am and then everything else to the end of the string.
To match one word after, you can do:
>>> a = "I am an apple"
>>> reg = re.compile(r'I am (\w+).*?$')
>>> print('How long have you been {}'.format(*reg.findall(a)))
How long have you been an
may be just a simple solution avoiding a weigth and cost regexp
>>> a = "i am foxmask"
>>> print a[5:]
foxmask
resp = raw_input("What is your favorite fruit?\n")
if "I like" in resp:
print "%s" - "I like" + " is a delicious fruit." % resp
else:
print "Bubbles and beans."
OK I know this code doesn't work, and I know why. You can't subtract strings from each other like numbers.
But is there a way to break apart a string and only use part of the response?
And by "is there a way" I really mean "how," because anything is possible in programming. :D
I'm trying to write my first chatterbot from scratch.
One option would be to simply replace the part that you want to remove with an empty string:
resp = raw_input("What is your favorite fruit?\n")
if "I like" in resp:
print "%s is a delicious fruit." % (resp.replace("I like ", ""))
else:
print "Bubbles and beans."
If you want to look into more advanced pattern matching to grab out more specific parts of strings via flexible patterns, you might want to look into regular expressions.
# python
"I like turtles".replace("I like ","")
'turtles'
Here's a way to do it with regular expressions:
In [1]: import re
In [2]: pat = r'(I like )*(\w+)( is a delicious fruit)*'
In [3]: print re.match(pat, 'I like apples').groups()[1]
apples
In [4]: print re.match(pat, 'apple is a delicious fruit').groups()[1]
apple
Can you just strip "I like"?
resp.strip("I like")
be careful of case sensitivity though.