I'm using PyQuery to process this HTML:
<div class="container">
<strong>Personality: Strengths</strong>
<br />
Text
<br />
<br />
<strong>Personality: Weaknesses</strong>
<br />
Text
<br />
<br />
</div>
Now that I've got a variable e point to .container, I'm looping through its children:
for c in e.iterchildren():
print c.tag
but in this way I can't get text nodes (the two Text string)
How can I loop an element's children include text nodes?
you can do it likes
for c in e.children():
p = PyQuery(c)
print p.__str__()
#here re.sub remove html tag
This code could get the raw text of each node.
If you want to distinguish the text tag from others :
raw = p.__str__().strip()
a = raw.rfind(">")
if (a+1!=len(raw)) :
print 'is text'
Related
I'm very at Python and BeautifulSoup and trying to up my game. Let's say this is my HTML:
<div class="container">
<h4>Title 1</h4>
Text I want is here
<br /> # random break tags inserted throughout
<br />
More text I want here
<br />
yet more text I want
<h4>Title 2</h4>
More text here, but I do not want it
<br />
<ul> # More HTML that I do not want</ul>
</div> # End container div
My expected output is the text between the two H4 tags:
Text I want is here
More text I want here
yet more text I want
But I don't know in advance what this text will say or how much of it there will be. There might be only one line, or there might be several paragraphs. It is not tagged with anything: no p tags, no id, nothing. The only thing I know about it is that it will appear between those two H4 tags.
At the moment, what I'm doing is working backward from the second H4 tag by using .previous_siblings to get everything up to the container div.
text = soup.find('div', class_ = 'container').find_next('h4', text = 'Title 2')
text = text.previous_siblings
text_list = []
for line in text:
text_list.append(line)
text_list.reverse()
full_text = ' '.join([str(line) for line in text_list])
text = full_text.strip().replace('<h4>Title 1</h4>', '').replace('<br />'>, '')
This gives me the content I want, but it also gives me a lot more that I don't want, plus it gives it to me backwards, which is why I need to use reverse(). Then I end up having to strip out a lot of stuff using replace().
What I don't like about this is that since my end result is a list, I'm finding it hard to clean up the output. I can't use get_text() on a list. In my real-life version of this I have about ten instances of replace() and it's still not getting rid of everything.
Is there a more elegant way for me to get the desired output?
You can filter the previous siblings for NavigableStrings.
For example:
from bs4 import NavigableString
text = soup.find('div', class_ = 'container').find_next('h4', text = 'Title 2')
text = text.previous_siblings
text_list = [t for t in text if type(t) == NavigableString]
text_list will look like:
>>> text_list
[u'\nyet more text I want\n', u'\nMore text I want here\n', u'\n', u'\nText I want is here\n', u'\n']
You can also filter out \n's:
text_list = [t for t in text if type(t) == NavigableString and t != '\n']
Other solution: Use .find_next_siblings() with text=True (that will find only NavigableString nodes in the tree). Then each iteration check, if previous <h4> is correct one:
from bs4 import BeautifulSoup
txt = '''<div class="container">
<h4>Title 1</h4>
Text I want is here
<br />
<br />
More text I want here
<br />
yet more text I want
<h4>Title 2</h4>
More text here, but I do not want it
<br />
</div>
'''
soup = BeautifulSoup(txt, 'html.parser')
out = []
first_h4 = soup.find('h4')
for t in first_h4.find_next_siblings(text=True):
if t.find_previous('h4') != first_h4:
break
elif t.strip():
out.append(t.strip())
print(out)
Prints:
['Text I want is here', 'More text I want here', 'yet more text I want']
I'd like to extract image path from text like this:
body = 'some text here <br> <img src="/path/to/1234/some_Random_name24.jpg" class="img-responsive" /> </br>'
OR
body = '<br> Hi <img src="/path/to/15004/other_Random_name.png" class="img-responsive" /> other text'
My regexp:
match = re.search(r'src=\"(?P<path1>\"', body)
if match:
print(match.group('path1'))
else:
print("no match found")
But can not capture any path. How can I fix this?
For a quick and dirty hack, you could use
<img[^>]*src="([^"]+)
The golden path would be to use a parser though. See a demo on regex101.com.
In Python this could be
import re
junk = """body = 'some text here <br> <img src="/path/to/1234/some_Random_name24.jpg" class="img-responsive" /> </br>'
body = '<br> Hi <img src="/path/to/15004/other_Random_name.png" class="img-responsive" /> other text'"""
rx = re.compile(r'<img[^>]*src="([^"]+)')
sources = rx.findall(junk)
print(sources)
Which yields
['/path/to/1234/some_Random_name24.jpg', '/path/to/15004/other_Random_name.png']
See another demo on ideone.com.
I've been trying to combine all text in the content element in XML using python.
I succeeded combining all content text but need to except content which is right below <'Br /> element.
<'Br /> element means Enter in adobe indesign program.
This XML is exported from adobe indesign.
This is example as follow :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Root>
<Story>
<ParagraphStyleRange>
<CharacterStyleRange>
<Content>AAA</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Content>BBB</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Br />
<Content>CCC</Content>
<Br />
<Content>DDD</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Content>EEE</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Br />
<Content>FFF</Content>
<Br />
</CharacterStyleRange>
</ParagraphStyleRange>
</Story>
</Root>
and it's what i want as follow :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Root>
<Story>
<ParagraphStyleRange>
<CharacterStyleRange>
<Content>AAA</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Content>AAABBB</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Br />
<Content>CCC</Content>
<Br />
<Content>DDD</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Content>DDDEEE</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Br />
<Content>FFF</Content>
<Br />
</CharacterStyleRange>
</ParagraphStyleRange>
</Story>
</Root>
As you see, i don't want to add content text to next one if there is <'Br /> element right above the content that i want to add.
In detail, the first Content element text is AAA and next one is BBB.
in this case AAA should be attched in front of BBB.
and BBB is not attached in front of CCC because there is <'Br /> element right above CCC Content.
Would you help me how to recognize the <'Br /> element to pass?
this is what i'am doing code so far, but it doesn't work well...
tree = ET.parse("C:\\Br_test.xml")
root = tree.getroot()
for ParagraphStyleRange in root.findall('.//Story/ParagraphStyleRange'):
CharacterStyleRange_count = len(ParagraphStyleRange.findall('CharacterStyleRange'))
#print(CharacterStyleRange_count)
if int(CharacterStyleRange_count) >= 2 :
try :
Content_collect = ''
for CharacterStyleRange in ParagraphStyleRange.findall('CharacterStyleRange'):
Br_count = len(CharacterStyleRange.findall('Br'))
print(Br_count)
if int(Br_count) == 0 :
for Content in CharacterStyleRange.findall('Content'):
Content_collect += Content.text
Content.text = str(Content_collect)
print(Content_collect)
#---- Code to delete Contents that are attached to next one---
#for CharacterStyleRange in ParagraphStyleRange.findall('CharacterStyleRange')[:-1]:
# for Content in CharacterStyleRange.findall('Content'):
# Content_remove = CharacterStyleRange.remove(Content)
except:
pass
I am using lxml to parse web document, I want to get all the text in a <p> element, so I use the code as follow:
from lxml import etree
page = etree.HTML("<html><p>test1 <br /> test2</p></html>")
print page.xpath("//p")[0].text # this just print "test1" not "test1 <br/> test2"
The problem is I want to get all text in <p> which is test1 <br /> test2 in the example, but lxml just give me test1.
How can I get all text in <p> element?
Several other possible ways :
p = page.xpath("//p")[0]
print etree.tostring(p, method="text")
or using XPath string() function (notice that XPath position index starts from 1 instead of 0) :
page.xpath("string(//p[1])")
Maybe like this
from lxml import etree
pag = etree.HTML("<html><p>test1 <br /> test2</p></html>")
# get all texts
print(pag.xpath("//p/text()"))
['test1 ', ' test2']
# concate
print("".join(pag.xpath("//p/text()")))
test1 test2
This is my HTML:
<div class="left_panel">
<h4>Header1</h4>
block of text that I want.
<br />
<br />
another block of text that I want.
<br />
<br />
still more text that I want.
<br />
<br />
<p> </p>
<h4>Header2</h4>
The number of blocks of text is variable, Header1 is consistent, Header2 is not.
I'm successfully extracting the first block of text using the following code:
def get_summary (soup):
raw = soup.find('div',{"class":"left_panel"})
for h4 in raw.findAllNext('h4'):
following = h4.nextSibling
return following
However I need all of the items sitting between the two h4 tags, I was hoping that using h4.nextSiblings would solve this, but for some reason that returns the following error:
TypeError: 'NoneType' object is not callable
I've been trying variations on this answer: Find next siblings until a certain one using beautifulsoup but the absence of a leading tag is confusing me.
Find the first header and iterate over .next_siblings until you hit an another header:
from bs4 import BeautifulSoup
data = """
<div class="left_panel">
<h4>Header1</h4>
block of text that I want.
<br />
<br />
another block of text that I want.
<br />
<br />
still more text that I want.
<br />
<br />
<p> </p>
<h4>Header2</h4>
</div>
"""
soup = BeautifulSoup(data)
header1 = soup.find('h4', text='Header1')
for item in header1.next_siblings:
if getattr(item, 'name') == 'h4' and item.text == 'Header2':
break
print item
Update (collecting texts between two h4 tags):
texts = []
for item in header1.next_siblings:
if getattr(item, 'name') == 'h4' and item.text == 'Header2':
break
try:
texts.append(item.text)
except AttributeError:
texts.append(item)
print ''.join(texts)
I don't understand why are you passing soup as an argument but you don't use it.
If you use the correct soup instance you shouldn't get that error. findAllNext(h4) returns <h4>Header1</h4> and <h4>Header2</h4>, applying nextSibling on each returns the text sibling, which are
block of text that I want.
and
')
in your case.