Match Text Within Parenthesis Multiple Times - python

Assume I have text like this:
<p>Joe likes <ul><li>pizza</li>, <li>burgers</li>, and <li>fries</li></ul></p>
I want to use a single regex to extract all of the text between the <li>/list tags using python.
regexp = <p>.+?(<li>.+?</li>).+?</p>
This only returns the first item in the list surrounded by the <li>/list tags:
<li>pizza</li>
Is there a way for me to grab all of the items between the <li>/list tags so my output would look like:
<li>pizza</li><li>burgers</li><li>fries</li>

This should work:
import re
source = '<p>Joe likes <ul><li>pizza</li>, <li>burgers</li>, and <li>fries</li></ul></p>'
res = ''.join(re.findall('<li>[^<]*</li>', source))
# <li>pizza</li><li>burgers</li><li>fries</li>

Assuming you have already extracted the example string you state you can do:
import re
s = "<p>Joe likes <ul><li>pizza</li>, <li>burgers</li>, and <li>fries</li></ul></p>"
re.findall("<li>.+?</li>", s)
Output:
['<li>pizza</li>', '<li>burgers</li>', '<li>fries</li>']

Why do you need the <p> tags ?
import re
source = '<p>Joe likes <ul><li>pizza</li>, <li>burgers</li>, and <li>fries</li></ul></p>'
m = re.findall('(<li>.+?</li>)',source)
print m
returns want you want.
Edit
If you only want text that is between <p> tags you can do it in two steps :
import re
source = '<p>Joe likes <ul><li>pizza</li>, <li>burgers</li>, and <li>fries</li></ul></p> and also <li>coke</li>'
ss = re.findall('<p>(.+?)</p>',source)
for s in ss:
m = re.findall('(<li>.+?</li>)',s)
print m

Try this regex with re.findall()
To get text: <li>([^<]*)</li> , To get tags: <li>[^<]*</li>
>>> import re
>>> s = "<p>Joe likes <ul><li>pizza</li>, <li>burgers</li>, and <li>fries</li></ul></p>"
>>> text=re.findall("<li>([^<]*)</li>", s)
>>> tag=re.findall("<li>[^<]*</li>", s)
>>> text
['pizza', 'burgers', 'fries']
>>> tag
['<li>pizza</li>', '<li>burgers</li>', '<li>fries</li>']
>>>

Related

Get strings between different tags in xml

text='<tag1>one</tag1>this should be displayed<tag2>two</tag2>this too<tag3>three</tag3>and this<tag4>four</tag4>'
Consider this string, using python
I want to print :
this should be displayed
this too
and this
not
one,two,three,four
I tried this code:
import re
text='<>one</>this should be displayed<>two</>this too<>three</>and this<>four</>'
start=0
m=re.findall('>(.+?)<',text)
print m
but I am getting all the strings:
['one', 'this should be displayed', 'two', 'this too', 'three', 'and this', 'four']
Need to add a forward slash in the first part of the match, also I would use ([^<]+?) – i think that's probably just semantics at this point, though, unless your input isn't correctly formatted.
m=re.findall('\/>([^<]+?)<',text)
And you just changed your question, so here's a new answer to find text outside of tags:
m=re.findall('</.+?>([^<]+?)<.+?>',text)
You almost had it, just need a /, Notice that you only want the words between /> and < not > and <:
Change this:
m=re.findall('>(.+?)<',text)
to this:
m=re.findall('/>(.+?)<',text)
Hence:
import re
text='<>one</>this should be displayed<>two</>this too<>three</>and this<>four</>'
print(re.findall('/>(.+?)<',text))
OUTPUT:
['this should be displayed', 'this too', 'and this']
EDIT:
Using BeautifulSoup:
from bs4 import BeautifulSoup
import bs4
text='<tag1>one</tag1>this should be displayed<tag2>two</tag2>this too<tag3>three</tag3>and this<tag4>four</tag4>'
soup = BeautifulSoup(text, 'html.parser')
for elem in soup:
if type(elem) is bs4.element.NavigableString: # only if the elem is not of a tag type
print(elem)
OUTPUT:
this should be displayed
this too
and this

How to join search patterns in CLiPS pattern.search

I do pattern matching in text using CLiPS pattern.search (Python 2.7).
I need to extract both phrases that correspond to 'VBN NP' and 'NP TO NP'.
I can do it separately and then join results:
from pattern.en import parse,parsetree
from pattern.search import search
text="Published case-control studies have a lot of information about susceptibility to asthma."
sentenceTree = parsetree(text, relations=True, lemmata=True)
matches = []
for match in search("VBN NP",sentenceTree):
matches.append(match.string)
for match in search("NP TO NP",sentenceTree):
matches.append(match.string)
print matches
# Output: [u'Published case-control studies', u'susceptibility to asthma']
But id I want to join this to one search pattern. If I try this I get no results at all.
matches = []
for match in search("VBN NP|NP TO NP",sentenceTree):
matches.append(match.string)
print matches
#Output: []
Official documentation gives no clues for this. I also had tried '{VBN NP}|{NP TO NP}' '[VBN NP]|[NP TO NP]' but without any luck.
Question is:
Is it possible to join search patterns in CLiPS pattern.search?
And if answer is "yes" then how to do it?
This pattern worked for me, {VBN NP} *+ {NP TO NP}, along with the match() and group() methods
>>> from pattern.search import match
>>> from pattern.en import parsetree
>>> t = parsetree('Published case-control studies have a lot of information about susceptibility to asthma.',relations= True)
>>> m = match('{VBN NP} *+ {NP TO NP}',t)
>>> m.group(0) #matches the complete pattern
[Word(u'Published/VBN'), Word(u'case-control/NN'), Word(u'studies/NNS'), Word(u'have/VBP'), Word(u'a/DT'), Word(u'lot/NN'), Word(u'of/IN'), Word(u'information/NN'), Word(u'about/IN'), Word(u'susceptibility/NN'), Word(u'to/TO'), Word(u'asthma/NN')]
>>> m.group(1) # matches the first group
[Word(u'Published/VBN'), Word(u'case-control/NN')]
>>> m.group(2) # matches the second group
[Word(u'susceptibility/NN'), Word(u'to/TO'), Word(u'asthma/NN')]
finally you can display the result as
>>> matches=[]
>>> for i in range(2):
... matches.append(m.group(i+1).string)
...
>>> matches
[u'Published case-control', u'susceptibility to asthma']

Using python regex to exclude '.' at the end but not inside a string

I am trying to use python regex to spot #mentions such as #user and #user.name
So far I have:
htmlcontent = re.sub(r'((\#)([\w\.-]+))', r"a href='/users/\3'>\1 /a>", htmlcontent)
When this code spots a #mention ending in a . it does not exclude it:
e.g. Hi #user.name. How are you?
Output so far:
<a href='/users/user.name.'>#user.name. /a>
Desired output:
<a href='/users/user.name'>#user.name /a> <-- without . after name
try this:
re.sub(r'((\#)([\w.-]+[\w]+))', r"<a href='/users/\3'>\1</a>", htmlcontent)
this will let the re engine know that '.' and '-' can be in the middle - but the string must end on a character.
running on your example:
In [3]: htmlcontent = 'Hi #user.name. How are you?'
In [4]: re.sub(r'((\#)([\w.-]+[\w]+))', r"<a href='/users/\3'>\1</a>", htmlcontent)
Out[4]: "Hi <a href='/users/user.name'>#user.name</a>. How are you?"
You could use a positive look ahead for the . at the end of the match like
([\w\.-]+)(?=\.\s)?
Example
string = "Hi #user.name. How are you?"
print re.sub(r'#([\w\.-]+)(?=\.\s)?', r"a href='/users/\1'>\1 /a>", string)
#Output
#Hi a href='/users/user.name.'>user.name. /a> How are you?
string = "Hi #user.name How are you?"
print re.sub(r'#([\w\.-]+)(?=\.\s)?', r"a href='/users/\1'>\1 /a>", string)
#Output
#Hi a href='/users/user.name'>user.name /a> How are you?

the right regex expression in python

I have a small problem to extract the words which are in bold:
Médoc, Rouge
2ème Vin, Margaux, Rosé
2ème vin, Pessac-Léognan, Blanc
I have to clarify more my question :
I'm trying to extract some information from web pages, so each time i found a kind of sentence but me i'm interesting in which is in bold. I give you the adress of the tree wab pages :
(http://www.nicolas.com/page.php/fr/18_409_9829_tourprignacgrandereserve.htm)
(http://www.nicolas.com/page.php/fr/18_409_8236_relaisdedurfortvivens.htm)
re(r'\s*\w+-\w+-\w+|\w+-\w+|\w+[^Rouge,Blanc,Rosé]')
Any ideas?
You can use positive look ahead to see if Rouge or Blanc or Rosé is after the word we are looking for:
>>> import re
>>> l = [u"Médoc, Rouge", u"2ème Vin, Margaux, Rosé", u"2ème vin, Pessac-Léognan, Blanc"]
>>> for s in l:
... print re.search(ur'([\w-]+)(?=\W+(Rouge|Blanc|Rosé))', s, re.UNICODE).group(0)
...
Médoc
Margaux
Pessac-Léognan
Seems like it's always the second to last term in the comma separated list? You can split and select the second to last, example:
>>> myStr = '2ème vin, Pessac-Léognan, Blanc'
>>> res = myStr.split(', ')[-2]
Otherwise, if you want regex alone... I'll suggest this:
>>> res = re.search(r'([^,]+),[^,]+$', myStr).group(1)
And trim if necessary for spaces.

regex or statement in any order

Python regular expression I have a string that contains keywords but sometimes the keywords dont exist and they are not in any particular oder. I need help with the regular expression.
Keywords are:
Up-to-date
date added
date trained
These are the keywords i need to find amongst a number of other keywords and they may not exist and will be in any order.
What the sting looks like
<div>
<h2 class='someClass'>text</h2>
blah blah blah Up-to-date blah date added blah
</div>
what i've tried:
regex = re.compile('</h2>.*(Up\-to\-date|date\sadded|date\strained)*.*</div>')
regex = re.compile('</h2>.*(Up\-to\-date?)|(date\sadded?)|(date\strained?).*</div>')
re.findall(regex,string)
The outcome i'm looking for would be:
If all exists
['Up-to-date','date added','date trained']
If some exists
['Up-to-date','','date trained']
Does it have to be a regex? If not, you could use find:
In [12]: sentence = 'hello world cat dog'
In [13]: words = ['cat', 'bear', 'dog']
In [15]: [w*(sentence.find(w)>=0) for w in words]
Out[15]: ['cat', '', 'dog']
This code does what you want, but it smells:
import re
def check(the_str):
output_list = []
u2d = re.compile('</h2>.*Up\-to\-date*.*</div>')
da = re.compile('</h2>.*date\sadded*.*</div>')
dt = re.compile('</h2>.*date\strained*.*</div>')
if re.match(u2d, the_str):
output_list.append("Up-to-date")
if re.match(da, the_str):
output_list.append("date added")
if re.match(dt, the_str):
output_list.append("date trained")
return output_list
the_str = "</h2>My super cool string with the date added and then some more text</div>"
print check(the_str)
the_str2 = "</h2>My super cool string date added with the date trained and then some more text</div>"
print check(the_str2)
the_str3 = "</h2>My super cool string date added with the date trained and then Up-to-date some more text</div>"
print check(the_str3)

Categories