Can not get image path using regex - python

I'd like to extract image path from text like this:
body = 'some text here <br> <img src="/path/to/1234/some_Random_name24.jpg" class="img-responsive" /> </br>'
OR
body = '<br> Hi <img src="/path/to/15004/other_Random_name.png" class="img-responsive" /> other text'
My regexp:
match = re.search(r'src=\"(?P<path1>\"', body)
if match:
print(match.group('path1'))
else:
print("no match found")
But can not capture any path. How can I fix this?

For a quick and dirty hack, you could use
<img[^>]*src="([^"]+)
The golden path would be to use a parser though. See a demo on regex101.com.
In Python this could be
import re
junk = """body = 'some text here <br> <img src="/path/to/1234/some_Random_name24.jpg" class="img-responsive" /> </br>'
body = '<br> Hi <img src="/path/to/15004/other_Random_name.png" class="img-responsive" /> other text'"""
rx = re.compile(r'<img[^>]*src="([^"]+)')
sources = rx.findall(junk)
print(sources)
Which yields
['/path/to/1234/some_Random_name24.jpg', '/path/to/15004/other_Random_name.png']
See another demo on ideone.com.

Related

How to keep all html elements with selector but drop all others?

I would like to get a HTML string without certain elements. However, upfront I just know which elements to keep but don't know which ones to drop.
Let's say I just want to keep all p and a tags inside the div with class="A".
Input:
<div class="A">
<p>Text1</p>
<img src="A.jpg">
<div class="sub1">
<p>Subtext1</p>
</div>
<p>Text2</p>
link text
</div>
<div class="B">
ContentDiv2
</div>
Expected output:
<div class="A">
<p>Text1</p>
<p>Text2</p>
link text
</div>
If I'd know all the selectors of all other elements I could just use lxml's drop_tree(). But the problem is that I don't know ['img', 'div.sub1', 'div.B'] upfront.
Example with drop_tree():
import lxml.cssselect
import lxml.html
tree = lxml.html.fromstring(html_str)
elements_drop = ['img', 'div.sub1', 'div.B']
for j in elements_drop:
selector = lxml.cssselect.CSSSelector(j)
for e in selector(tree):
e.drop_tree()
output = lxml.html.tostring(tree)
I'm still not entirely sure I understand correctly, but it seems like you may be looking for something resembling this:
target = tree.xpath('//div[#class="A"]')[0]
to_keep = target.xpath('//p | //a')
for t in target.xpath('.//*'):
if t not in to_keep:
target.remove(t) #I believe this method is better here than drop_tree()
print(lxml.html.tostring(target).decode())
The output I get is your expected output.
Try the below. The idea is to clean the root and add the required sub elements.
Note that no external lib is required.
import xml.etree.ElementTree as ET
html = '''<div class="A">
<p>Text1</p>
<img src="A.jpg"/>
<div class="sub1">
<p>Subtext1</p>
</div>
<p>Text2</p>
link text
ContentDiv2
</div>'''
root = ET.fromstring(html)
p_lst = root.findall('./p')
a_lst = root.findall('./a')
children = list(root)
for c in children:
root.remove(c)
for p in p_lst:
p.tail = ''
root.append(p)
for a in a_lst:
a.tail = ''
root.append(a)
root.text = ''
ET.dump(root)
output
<?xml version="1.0" encoding="UTF-8"?>
<div class="A">
<p>Text1</p>
<p>Text2</p>
link text
</div>

Python: Regular expression to extract text between any two tags in a html

I tried using "<.+>\s*(.*?)\s*<\/?.+>" on a HTML file. The following is the Python code I used
import re
def recursiveExtractor(content):
re1='(<.+>\s*(.+?)\s*<\/?.+>)'
m = re.findall(re1,content)
if m:
for (id,item) in enumerate(m):
text=m[id][1]
if text:print text,"\n"
f = """
<div class='a'>
<div class='b'>
<div class='c'>
<button>text1</button>
<div class='d'>text2</div>
</div>
</div>
</div>
"""
recursiveExtractor(f)
But it skips some text since HTML is nested and regex restarts search from the end of the matched part.
For the above input,
the output is
<div class='b'>
<div class='d'>text2</div>
</div>
But the expected Output is:
text1
text2
Edit:
I read that HTML is not a regular language and hence cant be parsed.From what I understand, it is not possible to parse .* (ie with same closing tags).
But what I need would be text between any tags, for instance text1 text2 text3 So I am fine with a list of "text1","text2","text3"
Why not just doing this:
import re
f = """
<div class='a'>
<div class='b'>
<div class='c'>
<button>text1</button>
<div class='d'>text2</div>
</div>
</div>
</div>
"""
x = re.sub('<[^>]*>', '', f) # you can also use re.sub('<[A-Za-z\/][^>]*>', '', f)
print '\n'.join(x.split())
This will have the following output:
text1
text2

PyQuery get text node

I'm using PyQuery to process this HTML:
<div class="container">
<strong>Personality: Strengths</strong>
<br />
Text
<br />
<br />
<strong>Personality: Weaknesses</strong>
<br />
Text
<br />
<br />
</div>
Now that I've got a variable e point to .container, I'm looping through its children:
for c in e.iterchildren():
print c.tag
but in this way I can't get text nodes (the two Text string)
How can I loop an element's children include text nodes?
you can do it likes
for c in e.children():
p = PyQuery(c)
print p.__str__()
#here re.sub remove html tag
This code could get the raw text of each node.
If you want to distinguish the text tag from others :
raw = p.__str__().strip()
a = raw.rfind(">")
if (a+1!=len(raw)) :
print 'is text'

Getting multiple blocks of text between tags

This is my HTML:
<div class="left_panel">
<h4>Header1</h4>
block of text that I want.
<br />
<br />
another block of text that I want.
<br />
<br />
still more text that I want.
<br />
<br />
<p> </p>
<h4>Header2</h4>
The number of blocks of text is variable, Header1 is consistent, Header2 is not.
I'm successfully extracting the first block of text using the following code:
def get_summary (soup):
raw = soup.find('div',{"class":"left_panel"})
for h4 in raw.findAllNext('h4'):
following = h4.nextSibling
return following
However I need all of the items sitting between the two h4 tags, I was hoping that using h4.nextSiblings would solve this, but for some reason that returns the following error:
TypeError: 'NoneType' object is not callable
I've been trying variations on this answer: Find next siblings until a certain one using beautifulsoup but the absence of a leading tag is confusing me.
Find the first header and iterate over .next_siblings until you hit an another header:
from bs4 import BeautifulSoup
data = """
<div class="left_panel">
<h4>Header1</h4>
block of text that I want.
<br />
<br />
another block of text that I want.
<br />
<br />
still more text that I want.
<br />
<br />
<p> </p>
<h4>Header2</h4>
</div>
"""
soup = BeautifulSoup(data)
header1 = soup.find('h4', text='Header1')
for item in header1.next_siblings:
if getattr(item, 'name') == 'h4' and item.text == 'Header2':
break
print item
Update (collecting texts between two h4 tags):
texts = []
for item in header1.next_siblings:
if getattr(item, 'name') == 'h4' and item.text == 'Header2':
break
try:
texts.append(item.text)
except AttributeError:
texts.append(item)
print ''.join(texts)
I don't understand why are you passing soup as an argument but you don't use it.
If you use the correct soup instance you shouldn't get that error. findAllNext(h4) returns <h4>Header1</h4> and <h4>Header2</h4>, applying nextSibling on each returns the text sibling, which are
block of text that I want.
and
')
in your case.

How to get text and replace text between certain tags

Given a string like
"<p> >this line starts with an arrow <br /> this line does not </p>"
or
"<p> >this line starts with an arrow </p> <p> this line does not </p>"
How can I find the lines that start with an arrow and surround them with a div
So that it becomes:
"<p> <div> >this line starts with an arrow </div> <br /> this line does not </p>
Since it is an HTML you are parsing, use the tool for the job - an HTML parser, like BeautifulSoup.
Use find_all() to find all text nodes that start with > and wrap() them with a new div tag:
from bs4 import BeautifulSoup
data = "<p> >this line starts with an arrow <br /> this line does not </p>"
soup = BeautifulSoup(data)
for item in soup.find_all(text=lambda x: x.strip().startswith('>')):
item.wrap(soup.new_tag('div'))
print soup.prettify()
Prints:
<p>
<div>
>this line starts with an arrow
</div>
<br/>
this line does not
</p>
You can try with >\s+(>.*?)< regex pattern.
import re
regex = re.compile("\\>\\s{1,}(\\>.{0,}?)\\<")
testString = "" # fill this in
matchArray = regex.findall(testString)
# the matchArray variable contains the list of matches
and replace matched group with <div> matched_group </div>. Here pattern look for anything that is enclosed inside > > and <.
Here is demo on debuggex
You could try this regex,
>(\w[^<]*)
DEMO
Python code would be,
>>> import re
>>> str = '"<p> >this line starts with an arrow <br /> this line does not </p>"'
>>> m = re.sub(r'>(\w[^<]*)', r"<div> >\1</div> ", str)
>>> m
'"<p> <div> >this line starts with an arrow </div> <br /> this line does not </p>"'

Categories