I am new to scrapy. I want to crawl some data from the web. I got the html document like below.
dom style1:
<div class="user-info">
<p class="user-name">
something in p tag
</p>
text data I want
</div>
dom style2:
<div class="user-info">
<div>
<p class="user-img">
something in p tag
</p>
something in div tag
</div>
<div>
<p class="user-name">
something in p tag
</p>
text data I want
</div>
</div>
I want to get the data text data I want, now I can use css or xpath selector to get it by check it exists. But I want to know some better ways.
For example, I can get css p.user-name first, and then I get it's parent, and then I get it's div/text(), and always the data I want is the text() of the p.user-name's immediate parent div, but the question is, how can I get the immediate parent p.user-name?
With xpath you can traverse the xml tree in every direction(parent, sibling, child etc.) where css doesn't support this.
For your case you can get node's parent with xpath .. parent notation:
//p[#class='user-name']/../text()
Explanation:
//p[#class='user-name'] - find <p> nodes with class value user-name.
/.. - select node's parent.
/text() - select text of the current node.
This xpath should work in both of your described cases.
What about using following-sibling axis?
>>> s = scrapy.Selector(text='''<div class="user-info">
... <p class="user-name">
... something in p tag
... </p>
... text data I want
... </div>''')
>>> username = s.css('p.user-name')[0]
>>> username.xpath('following-sibling::text()[1]').get()
'\n text data I want\n'
>>>
>>> s2 = scrapy.Selector(text='''<div class="user-info">
... <div>
... <p class="user-img">
... something in p tag
... </p>
... something in div tag
... </div>
... <div>
... <p class="user-name">
... something in p tag
... </p>
... text data I want
... </div>
... </div>''')
>>> username = s2.css('p.user-name')[0]
>>> username.xpath('following-sibling::text()[1]').get()
'\n text data I want\n '
>>>
Related
With Python Scrapy, I am trying to get contents in a webpage whose nodes look like this:
<div id="title">Title</div>
<ul>
<li>
<span>blahblah</span>
<div>blahblah</div>
<p>CONTENT TO EXTRACT</p>
</li>
<li>
<span>blahblah</span>
<div>blahblah</div>
<p>CONTENT TO EXTRACT</p>
</li>
...
</ul>
I'm a newbie with XPath and couldn't get it for now. My last try was something like:
contents = response.xpath('[#id="title"]/following-sibling::ul[1]//li//p.text()')
... but it seems I cannot use /following-sibling after [#id="title"].
Any idea?
Try this XPath
contents = response.xpath('//div[#id="title"]/following-sibling::ul[1]/li/p/text()')
It selects both "CONTENT TO EXTRACT" text nodes.
One XPath would be:
response.xpath('//*[#id="title"]/following-sibling::ul[1]//p/text()).getall()
which get text from every <p> tag child or grand child of nearest <ul> tag to node with id = "title".
XPath syntax
Try this using css selector.
response.css('#title ::text).extract()
I have a value I need to grab out of a div tag. Within the div there is a <p>, <span> and <input>. When I write out the results of the find_all for the main <div> I can see everything I want to get. But when I look for all the <span> tags within that main div, the one I need doesn't exist/return in the results.
This is what is actually on the page source
<div class="video-details">
<p>Web ID: <span itemprop="sku">15COLU2BRNRSTVXXXCAC</span></p>
<span id="SkuDisplay">
<p> SKU: 12139884</p>
</span>
<input type="hidden" id="selectedSku" value="660852" autocomplete="off">
</div>
This is what I have right now that will return everything in(spanSKUitems) above except for the <p> SKU </p> line
for spanSKUitems in soup.find_all('div',class_="video-details"):
for spanSKUitem in spanSKUitems.find_all('span'):
strspanSKUitem = str(spanSKUitem.get_text())
if 'SKU:' in strspanSKUitem:
bidx = strspanSKUitem.index(':')+1
lidx = len(strspanSKUitem)
dets['sku']=strspanSKUitem[bidx:lidx].lstrip()
This is what is contained in the spanSKUitems:
<div class="video-details">
<p>Web ID: <span itemprop="sku">15COLU2BRNRSTVXXXCAC</span></p>
<span id="SkuDisplay"></span>
<input id="selectedSku" type="hidden" value=""/></div>
What am I missing or doing wrong?
What do I need to get is this tag <p> SKU: 12139884</p>?
The following works based on your additional html provided. The data is in a string of a span tag with a different id. You can load with json and then extract:
import json
data = soup.select_one('#skuDescriptivattribute').text
data = json.loads(data)
print(data['descriptive'][0]['partNumber'])
I'm trying to parse the follow HTML code in python using beautiful soup. I would like to be able to search for text inside a tag, for example "Color" and return the text next tag "Slate, mykonos" and do so for the next tags so that for a give text category I can return it's corresponding information.
However, I'm finding it very difficult to find the right code to do this.
<h2>Details</h2>
<div class="section-inner">
<div class="_UCu">
<h3 class="_mEu">General</h3>
<div class="_JDu">
<span class="_IDu">Color</span>
<span class="_KDu">Slate, mykonos</span>
</div>
</div>
<div class="_UCu">
<h3 class="_mEu">Carrying Case</h3>
<div class="_JDu">
<span class="_IDu">Type</span>
<span class="_KDu">Protective cover</span>
</div>
<div class="_JDu">
<span class="_IDu">Recommended Use</span>
<span class="_KDu">For cell phone</span>
</div>
<div class="_JDu">
<span class="_IDu">Protection</span>
<span class="_KDu">Impact protection</span>
</div>
<div class="_JDu">
<span class="_IDu">Cover Type</span>
<span class="_KDu">Back cover</span>
</div>
<div class="_JDu">
<span class="_IDu">Features</span>
<span class="_KDu">Camera lens cutout, hard shell, rubberized, port cut-outs, raised edges</span>
</div>
</div>
I use the following code to retrieve my div tag
soup.find_all("div", "_JDu")
Once I have retrieved the tag I can navigate inside it but I can't find the right code that will enable me to find the text inside one tag and return the text in the tag after it.
Any help would be really really appreciated as I'm new to python and I have hit a dead end.
You can define a function to return the value for the key you enter:
def get_txt(soup, key):
key_tag = soup.find('span', text=key).parent
return key_tag.find_all('span')[1].text
color = get_txt(soup, 'Color')
print('Color: ' + color)
features = get_txt(soup, 'Features')
print('Features: ' + features)
Output:
Color: Slate, mykonos
Features: Camera lens cutout, hard shell, rubberized, port cut-outs, raised edges
I hope this is what you are looking for.
Explanation:
soup.find('span', text=key) returns the <span> tag whose text=key.
.parent returns the parent tag of the current <span> tag.
Example:
When key='Color', soup.find('span', text=key).parent will return
<div class="_JDu">
<span class="_IDu">Color</span>
<span class="_KDu">Slate, mykonos</span>
</div>
Now we've stored this in key_tag. Only thing left is getting the text of second <span>, which is what the line key_tag.find_all('span')[1].text does.
Give it a go. It can also give you the corresponding values. Make sure to wrap the html elements within content=""" """ variable between Triple Quotes to see how it works.
from bs4 import BeautifulSoup
soup = BeautifulSoup(content,"lxml")
for elem in soup.select("._JDu"):
item = elem.select_one("span")
if "Features" in item.text: #try to see if it misses the corresponding values
val = item.find_next("span").text
print(val)
i'm using XPath to scrap one web page, but i'm with trouble with one part of the code:
<div class="description">
here's the page description
<span> some other text</span>
<span> another tag </span>
</div>
i'm using this code to get the value from element:
description = tree.xpath('//div[#class="description"]/text()')
i can find the correct div i'm looking for, but i only want to get the text "here's the page description" not the content from inner span tags
anyone know how can i get only the text in the root node but not the content from child nodes?
The expression you are currently using would actually match the top-level text child nodes only. You can just wrap it into normalize-space() to clean up the text from extra newlines and spaces:
>>> from lxml.html import fromstring
>>> data = """
... <div class="description">
... here's the page description
... <span> some other text</span>
... <span> another tag </span>
... </div>
... """
>>> root = fromstring(data)
>>> root.xpath('normalize-space(//div[#class="description"]/text())')
"here's the page description"
To get the complete text of a node including the child nodes, use the .text_content() method:
node = tree.xpath('//div[#class="description"]')[0]
print(node.text_content())
I have an HTML document as follows:
<h1> Name of Article </h2>
<p>First Paragraph I want</p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>
<h2> References </h2>
<p>Html I do not want...</p>
I don't need references from the article, I want to slice the document at the second h2 tag.
Obviously I can find a list of h2 tags like so:
soup = BeautifulSoup(html)
soupset = soup.find_all('h2')
soupset[1] #this would get the h2 heading 'References' but not what comes before it
I don't want to get a list of the h2 tags, I want to slice the document right at the second h2 tag and keep the above contents in a new variable. Basically the desired output I want is:
<h1> Name of Article </h2>
<p>First Paragraph I want<p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>
What is the best way to go aboout doing this "slicing"/cutting of the HTML document instead of simply finding tags and outputing the tags itself?
You can remove/extract every sibling element of the "References" element and the element itself:
import re
from bs4 import BeautifulSoup
data = """
<div>
<h1> Name of Article </h2>
<p>First Paragraph I want</p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>
<h2> References </h2>
<p>Html I do not want...</p>
</div>
"""
soup = BeautifulSoup(data, "lxml")
references = soup.find("h2", text=re.compile("References"))
for elm in references.find_next_siblings():
elm.extract()
references.extract()
print(soup)
Prints:
<div>
<h1> Name of Article</h1>
<p>First Paragraph I want</p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>
</div>
You can find the location of the h2 in the string and then find a substring by it:
last_h2_tag = str(soup.find_all("h2")[-1])
html[:html.rfind(last_h2_tag) + len(last_h2_tag)]