I'm trying to scrape the title of the following html code:
<FONT COLOR=#5FA505><B>Claim:</B></FONT> Coed makes unintentionally risqué remark about professor's "little quizzies."
<BR><BR>
<CENTER><IMG SRC="/images/content-divider.gif"></CENTER>
I'm using this code:
def parse_article(self, response):
for href in response.xpath('//font[b = "Claim:"]/following-sibling::text()'):
print href.extract()
and I succesfully pull the correct Claim: value that I want from the aforementioned html code but it also, (among others with similar structure in the same page) pulls the below html. I am defining my xpath() to just pull in the font tag named Claim: so why is it pulling in the below Origins as well? And how can I fix it? I tried seeing if I could get only the next following-sibling instead of all of them, but that didn't work
<FONT COLOR=#5FA505 FACE=""><B>Origins:</B></FONT> Print references to the "little quizzies" tale date to 1962, but the tale itself has been around since the early 1950s. It continues to surface among college students to this day. Similar to a number of other college legends
I think your xpath is missing text() qualifier (explained here). It should be:
'//font/[b/text()="Claim:"]/following-sibling::text()'
The following-sibling axis returns all siblings following an element. If you only want the first sibling, try the XPath expression:
//font[b = "Claim:"]/following-sibling::text()[1]
Or, depending on your exact use case:
(//font[b = "Claim:"]/following-sibling::text())[1]
Related
I've been trying to scoop out a portion of text out of some html elements using xapth but It seems I'm going somewhere wrong that is why I can't make it.
Html elements:
htmlelem = """
<div class="content">
<p>Type of cuisine: </p>International
</div>
"""
I would like to dig out International using xpath. I know I could get success using .next_sibling If I wanted to extract the same using css selector but I'm not interested in going that route.
That said If I try like this I can get the same using xpath:
tree.xpath("//*[#class='content']/p/following::text()")[0]
But the above expression is not what I'm after cause I can't use the same within selenium webdriver If I stick to driver.find_element_by_xpath()
The only way that I'm interested in is like the following but it is not working:
"//*[#class='content']/p/following::*"
Real-life example:
from lxml.html import fromstring
htmlelem = """
<div class="content">
<p>Type of cuisine: </p>International
</div>
"""
tree = fromstring(htmlelem)
item = tree.xpath("//*[#class='content']/p/following::text()")[0].strip()
elem = tree.xpath("//*[#class='content']/p/following::*")[0].text
print(elem)
In the above example, I can get success printing item but can't printing elem. However, I would like to modify the expression used within elem.
How can I make it work so that the same xpath I can use within lxml library or within selenium?
Since OP was looking for a solution which extracts the text from outside the xpath, the following should do that, albeit in a somewhat awkward manner:
tree.xpath("//*[#class='content']")[0][0].tail
Output:
International
The need for this approach is a result of the way lxml parses the html code:
tree.xpath("//*[#class='content']") results in a list of length=1.
The first (and only) element in the list - tree.xpath("//*[#class='content']")[0] is a lxml.html.HtmlElement which itself can be treated as a list and also has length=1.
In the tail of the first (and only) element in that lxml.html.HtmlElement hides the desired output...
I'm new to scrapy and XPath but programming in Python for sometime. I would like to get the email, name of the person making the offer and phone number from the page https://www.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/ using scrapy. As you see, the email and phone is provided as text inside the <p> tag and that makes it hard to extract.
My idea is to first get text inside the Job Overview or at least all the text talking about this respective job and use ReGex to get the email, phone number and if possible the name of the person.
So, I fired up the scrapy shell using the command: scrapy shell https://www.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/ and get the response from there.
Now, I try to get all the text from the div job_description where I actually get nothing. I used
full_des = response.xpath('//div[#class="job_description"]/text()').extract()
It returns [u'\t\t\t\n\t\t ']
How do I get all the text from the page mentioned ? Obviously, the task will come afterwards to get the attributes mentioned before, but, first things first.
Update: This selection only returns [] response.xpath('//div[#class="job_description"]/div[#class="container"]/div[#class="row"]/text()').extract()
You were close with
full_des = response.xpath('//div[#class="job_description"]/text()').extract()
The div-tag actually does not have any text besides what you get.
<div class="job_description" (...)>
"This is the text you are getting"
<p>"This is the text you want"</p>
</div>
As you see, the text you are getting with response.xpath('//div[#class="job_description"]/text()').extract() is the text that is in between the div-tag, not in between the tags inside the div-tag. For this you would need:
response.xpath('//div[#class="job_description"]//*/text()').extract()
What this does is it selects all the child-nodes from div[#class="job_description] and returns the text (see here for what the different xpaths do).
You will see that this returns much useless text as well, as you are still getting all the \n and such. For this I suggest that you narrow your xpath down to the element that you want, instead of doing a broad approach.
For example the entire job description would be in
response.xpath('//div[#class="col-sm-5 justify-text"]//*/text()').extract()
Working with scrapy spider and its pulling wrong output for price.
HTML:
<span style="" class="b-product_price-standard b-product_price-standard--line_through">$350</span>
Xpath:
['price'] = sel.xpath('normalize-space(div/main/div[4]/div[3]/div/div[1]/h1[2]/div/span[1]/text())').extract()
result:
'price': [u'\u20ac300]
It seems to be the "$" in the price is causing the issues. I've been digging and I can't seem to find an answer to what I thought would be a common issue, which has me thinking it might be more to it i'm missing.
Any help is greatly appreciated!
Use re instead extract:
['price'] = sel.xpath('.../span[1]/text())').re('\d+')
Casimir et Hippolyte is right, the right result is retrieved, but its representation in Python looks different. But besides that, your XPath expression is not ideal.
Try not to rely on long-winded positional XPath expressions, they break very easily when there are small changes to the HTML document.
Instead, try to find elements by their attributes. Perhaps this combination of class attributes is unique? For instance
//span[#class = 'b-product_price-standard b-product_price-standard--line_through']
could work. If it does not, you have to show more of the HTML document you are selecting from.
I'm trying to scrape the title of the following html code:
<FONT COLOR=#5FA505><B>Claim:</B></FONT> Coed makes unintentionally risqué remark about professor's "little quizzies."
<BR><BR>
<CENTER><IMG SRC="/images/content-divider.gif"></CENTER>
I've tried using:
def parse_article(self, response):
for href in response.xpath('//font[#color="#5FA505"]/'):
but the title (Coed makes unintentionally...) isn't actually embedded in any tags so I haven't been able to actually get that content. Is there a way I can get the content without it being embedded in <p> or any sort of tags?
EDIT: //font[b = "Claim:"]/following-sibling::text() works but it also grabs and displays this bottom piece of html.
<FONT COLOR=#5FA505 FACE=""><B>Origins:</B></FONT> Print references to the "little quizzies" tale date to 1962, but the tale itself has been around since the early 1950s. It continues to surface among college students to this day. Similar to a number of other college legends
Assuming you know that there is the Claim: text beforehand, locate the font tag by the text of its b child and get the following text sibling:
//font[b = 'Claim:']/following-sibling::text()
Demo from the Scrapy Shell:
In [1]: "".join(map(unicode.strip, response.xpath("//font[b = 'Claim:']/following-sibling::text()").extract()))
Out[1]: u'Coed makes unintentionally risqu\xe9 remark about professor\'s "little quizzies."'
Note that these join and strip calls should be ideally replaced by the appropriate input or output processors used inside Item Loaders.
I just starting using BeautifulSoup and I am encountering a problem. I set up a html snippet below and make a BeautifulSoup object:
html_snippet = '<p class="course"><span class="text84">Ae 100. Research in Aerospace. </span><span class="text85">Units to be arranged in accordance with work accomplished. </span><span class="text83">Open to suitably qualified undergraduates and first-year graduate students under the direction of the staff. Credit is based on the satisfactory completion of a substantive research report, which must be approved by the Ae 100 adviser and by the option representative. </span> </p>'
subject = BeautifulSoup(html_snippet)
I have tried doing several find and find_all operations like below but all I am getting is nothing or an empty list:
subject.find(text = 'A')
subject.find(text = 'Research')
subject.next_element.find('A')
subject.find_all(text = 'A')
When I created the BeautifulSoup object from a html file on my computer before, the find and find_all operations were all working fine. However, when I pulled the html_snippet from reading a webpage online through urllib2, I am getting problems.
Can anyone point out where the issue is?
Pass the argument like this:
import re
subject.find(text=re.compile('A'))
The default behavior for the text filter is to match on the entire body. Passing in a regular expression lets you match on fragments.
EDIT: To match only bodies beginning with A, you can use the following:
subject.find(text=re.compile('^A'))
To match only bodies containing words that begin with A, you can use:
subject.find_all(text = re.compile(r'\bA'))
It's difficult to tell more specifically what you're looking for, let me know if I've misinterpreted what you're asking.