I am trying to create "universal" Xpath, so when I run spider, it will be able to download the hotel name for each hotel on the list.
This is the XPath that I need to convert:
//*[#id="offerPage"]/div[3]/div[1]/div[1]/div/div/div/div/div[2]/div/div[1]/h3/a
Can anyone point me the right direction?
This is the example how they did it in the scrapy docs:
https://github.com/scrapy/quotesbot/blob/master/quotesbot/spiders/toscrape-xpath.py
for text: they have :
'text': quote.xpath('./span[#class="text"]/text()').extract_first(),
When you open "http://quotes.toscrape.com/" and copy Xpath for text you will get :
/html/body/div/div[2]/div[1]/div[1]/span[1]
When you look at the html that your are scraping just using "copy xpath" from the browser source viewer is not enough.
You need to look at the attributes that the html tags have.
Of course, using just tag types as an xpath can work, but what if not every page you are going to scrape follows that pattern?
The Scrapy example you are using uses the span's class attribute to precisely point to the target tag.
I suggest reading a bit more about Xpath (for example here) to understand how flexible your search patterns can be.
If you want to go even broader, reading about DOM structure will also be useful. Let us know if you need more pointers.
Related
I've been researching this for two days now. There seems to be no simple way of doing this. I can find an element on a page by downloading the html with Selenium and passing it to BeautifulSoup, followed by a search via classes and strings. I want to click on this element after finding it, so I want to pass its Xpath to Selenium. I have no minimal working example, only pseudo code for what I'm hoping to do.
Why is there no function/library that lets me search through the html of a webpage, find an element, and then request it's Xpath? I can do this manually by inspecting the webpage and clicking 'copy Xpath'. I can't find any solutions to this on stackoverflow, so please don't tell me I haven't looked hard enough.
Pseudo-Code:
*parser is BeautifulSoup HTML object*
for box in parser.find_all('span', class_="icon-type-2"): # find all elements with particular icon
xpath = box.get_xpath()
I'm willing to change my code entirely, as long as I can locate a particular element, and extract it's Xpath. So any other ideas on entirely different libraries are welcome.
I'm making a twitterbot for an honors project and have it almost completed. However, when I scrape the website for a specific URL, the href refers to a link that looks like this:
?1dmy&urile=wcm%3apath%3a%2Fohio%2Bcontent%2Benglish%2Fcovid-19%2Fresources%2Fnews-releases-news-you-can-use%2Fnew-restartohio-opening-dates
When inspecting the html and hovering over the href contents above, it shows that the above is actually the tail-end of the link. Is there any way to take this data and make it into a usable link? Other links within the same carousal provide full links on the same website, so I'm not sure why this is different than the others.
I tried searching for answers to this question but came up short: sorry if this is a repeat.
BeautifulSoup is showing you what the HTML of the page has. If the link is relative, you need the base URL for the page. That should come back in your request result, not in the HTML itself.
I come across a html page which has the following code in it:
This message-banner defines a kind of overlay over the actual page with a message in it. I neither do not find a definition for that tag, not am I able to select an existing element as part of this message-banner.
Any ideas how I can handle this issue? I cannot provide a more elaborate example as this is from a non-public webpage.
I am scraping individual listing pages from justproperty.com (individual listing from the original question no longer active).
I want to get the value of the Ref
this is my xpath:
>>> sel.xpath('normalize-space(.//div[#class="info_div"]/table/tbody/tr/td[norma
lize-space(text())="Ref:"]/following-sibling::td[1]/text())').extract()[0]
This has no results in scrapy, despite working in my browser.
The following works perfectly in lxml.html (with modern Scrapy uses):
sel.xpath('.//div[#class="info_div"]//td[text()="Ref:"]/following-sibling::td[1]/text()')
Note that I'm using // to get between the div and the td, not laying out the explicit path. I'd have to take a closer look at the document to grok why, but the path given in that area was incorrect.
Don't create XPath expression by looking at Firebug or Chrome Dev Tools, they're changing the markup. Remove the /tbody axis step and you'll receive exactly what you're look for.
normalize-space(.//div[#class="info_div"]/table/tr/td[
normalize-space(text())="Ref:"
]/following-sibling::td[1]/text())
Read Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing? for more details.
Another XPath that gets the same thing: (.//td[#class='titles']/../td[2])[1]
I tried your XPath using XPath Checker and it works fine.
I inherited someone elses (dreadful) codebase, and am currently desperately trying to fix things. Today, that means gathering a list of all the dead links in our template/homepage.
I'm currently using ElementTree in Python, trying to parse the site using xpath. Unfortunately, it seems that the html is malformed, and ElementTree keeps throwing errors.
Are there more error friendly xpath parsers? Is there a way to run ElementTree in a non-strict mode? Are there any other methods, such as preprocessing, that can be used to help this process?
LXML can parse some malformed HTML, implements an extended version of the ElementTree API, and supports XPath:
>>> from lxml import html
>>> t = html.fromstring("""<html><body>Hello! <p> Goodbye.</body></html""")
>>> html.tostring(t.xpath("//body")[0])
'<body>Hello! <p> Goodbye.</p></body>'
My commiserations!
You'd be better off parsing your HTML with BeautifulSoup. As the homepage states:
You didn't write that awful page. You're just trying to get some data
out of it. Beautiful Soup is here to help. Since 2004, it's been
saving programmers hours or days of work on quick-turnaround screen
scraping projects.
and more importantly:
Beautiful Soup parses anything you give it, and does the tree
traversal stuff for you. You can tell it "Find all the links", or
"Find all the links of class externalLink", or "Find all the links
whose urls match "foo.com", or "Find the table heading that's got bold
text, then give me that text."
BeautifulSoup can very well deal with malformed HTML. You should also definitely look at How do I fix wrongly nested / unclosed HTML tags?. There, also Tidy was suggested.
This is a bit OT, but since it's the links you are interested in, you could also use an external link checker.
I've used Xenu Link Sleuth for years and it works great. I have a couple of sites that have more than 15,000 internal pages and running Xenu on the LAN with 30 simultaneous threads it takes about 5-8 minutes to check the site. All link types (pages, images, CSS, JS, etc.) are checked and there is a simple-but-useful exclusion mechanism. It runs on XP/7 with whatever authorization MSIE has, so you can check member/non-member views of your site.
Note: Do not run it when logged into an account that has admin privileges or it will dutifully wander backstage and start hitting delete on all your data! (Yes, I did that once -- fortunately I had a backup. :-)