Python. Certain XPath expressions not working

Python. Certain XPath expressions not working - python

I've been expermenting with XPath through Python.
The thing is that not all the expressions work.
I have just found the XPath helper chrome extension.
As you see Chrome detects the XPath, but Python doesn't.
The website : link
My code :
import __future__
from lxml import html
import requests
from bs4 import BeautifulSoup
page = requests.get('http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution')
soup = str(BeautifulSoup(page.content, 'html.parser'))
tree = html.fromstring(soup)
smth = tree.xpath('/html/body/table[#class="center"][2]/tbody/tr[1]/td[2]/table[2]/tbody/tr/td/table/tbody/tr/td[2]/table/tbody/tr/td/text()')
print(smth)
smth list is empty. Why? It should have consisted of all the tds I indicated in the Xpath.

Somehow it getting annoying to get the same problem again and again with only slightly different questions.
The problem is ( and this will not changes) that the html on the page is completely broken. So you need to start to accept that the DOM interpretation is different between browser, lxml or BeautifulSoup. I suggest to save the soup string to a file an try to figure out what BeautifulSoup did with the broken html.
With that you may figure out what (if any) the right xpath may be.

Your xpath is using tbody as part of the selector, when no tbody tags exist in those tables. Your browser is filling in tbody sections when it renders the page because they're a required part of the spec, but if you view the source you'll see they don't actually exist.
Don't trust what the browser sees. Especially if you have javascript enabled. You'll often end up with pages where the element tree is nothing like your simple requests.get() will see.

Related

Why do i get none or empty list when using find or find_all in beautifulsoap even though the tags do contain other tags and data

I am trying to make a coronavirus tracker using beautifulsoup just for some practice.
my code is,
import requests
from bs4 import BeautifulSoup
page=requests.get("https://sample.com")
soup=BeautifulSoup(page.content,'html.parser')
table=soup.find("div",class_="ZDcxi")
print(table)
In the output its showing none, but the div tag with the class ZDcxi do have content.
please help

The data, which you see in the browser, and includes the target div, is dynamic content, generated by scripts included with the page and run in the browser. If you just search for the class name in page.content, you will find it is not there.
What many people do is use selenium to open desired pages through Chrome (or another web browser), and then, after the page finishes loading and generating dynamic content, use BeautifulSoup to harvest the content from the browser, and continue processing from there.
Find out more at Requests vs Selenium Python, and also when you search selenium vs requests/

Is it possible to use Selenium to fetch page source, then use lxml to scrape data by xpath?

Selenium can be used to navigate a web site (login, get html source of a page on the site),
but then there is nothing in Selenium that will find/get data in that HTML by xpath (find_element_by_xpath() will find elements, but not TEXT data outside of tags, and therefore something else must be used like lxml), Selenium absolutely cannot be used to do this, as when you try, it throws an error.
There are no examples anywhere of using Selenium to get the HTML source, passing that to lxml to parse the HTML and find / get data by xpath anywhere on the web.
It is not to be found.
lxml examples are usually given in conjunction with the Python 'requests' library from which the response in bytes (response.content) is obtained.
lxml uses this response.content (bytes), but with lxml, no functions accept the HTML as a string.
Selenium only returns html as a string: self.driver.page_source
So what to do here?
I need to use lxml, because it provides xpath capability.
I cannot use Python's requests library to login to a web site and navigate to a page, it just does not work with this site because of some complexities of how they designed things.
Selenium is the only thing that will work to login, create a session, pass the right cookies on a subsequent GET request.
I need to use selenium and 'page_source' (string), but I am not sure how to convert to the exact 'bytes' that the functions 'lxml' requires.
It's proving quite difficult to scrape using Python with the way the libraries here do not work together and lack of options with Selenium to produce the HTML as bytes,
and the lack of lxml to accept data either as string or bytes.
any and all help would be appreciated, but I don't believe it can be answered unless you have specifically experienced this problem, and have successfully used Selenium + lxml together.

Try something along these lines and see if it works for you:
data = self.driver.page_source
doc = lxml.html.fromstring(data)
target = doc.xpath('some xpath')

MechanicalSoup tricky html tables

I'm completely green to MechanicalSoup and webscraping.
I have been working on parsing a html timetable and making it into icalendar (ics) file to get it on mobile. (Which i have succesfully done, yay).
Now to make it work, I downloaded the html of the timetable site once I had selected my timetable. Now I need to use Python to actually navigate to the timetable.
Here is my code so far (I am stuck because the HTML is sooo messy I don't know how to do it, and the documentation for MechanicalSoup is not that large yet):
import argparse
import mechanicalsoup
from getpass import getpass
browser = mechanicalsoup.StatefulBrowser(
soup_config={'features': 'lxml'},
raise_on_404=True,
user_agent='MyBot/0.1: mysite.example.com/bot_info',
)
browser.open("http://keaplan.kea.dk/sws/prodE2017/default.aspx")
browser.select_form(WHAT TO SELECT :D)
See the HTML here :( http://keaplan.kea.dk/sws/prodE2017/default.aspx
I want to do the following:
td class=“FilterPanel” #go to the table containing this td
div id = pFilter #set value to BYG
div id = pObject #set value to BAKINT-2l
submit (which will redirect to the timetable i need)
and download the html from the submitted redirect.
Help is lovingly appreciated!

The argument of select_form is a CSS selector. If you have just one form, then "form" can do the trick (the next version of MechanicalSoup will actually have this as default argument). Otherwise, use your browser's developer tools, for example Firefox has Right-Click -> Inspect Element -> Right Click -> Copy -> CSS selector, that can be a good starting point.
In your case, even thought there's a funny layout, there is only one form, so:
browser.select_form("form")
Unfortunately, the page you are pointing is partly generated with JavaScript (the select element you're searching doesn't appear in the soup object obtained by parsing the page). See what MechanicalSoup sees from your page with
browser.launch_browser()
:-(. You can work around the issue by creating the missing controls yourself with new_control.

Web scraping for divs inserted by scripts

Sorry if this is a silly question.
I am trying to use Beautifulsoup and urllib2 in python to look at a url and extract all divs with a particular class. However, the result is always empty even though I can see the divs when I "inspect element" in chrome's developer tools.
I looked at the page source and those divs were not there which means they were inserted by a script. So my question is how can i look for those divs (using their class name) using Beautifulsoup? I want to eventually read and follow hrefs under those divs.
Thanks.
[Edit]
I am currently looking at the H&M website: http://www.hm.com/sg/products/ladies and I am interested to get all the divs with class 'product-list-item'

Try using selenium to run the javascript
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.python.org")
html = driver.page_source

check this link enter link description here
you can get all info by change the url, this link can be found in chrome dev tools > Network

The reason why you got nothing from that specific url is simply because, the info you need is not there.
So first let me explain a little bit about how that page is loaded in a browser: when you request for that page(http://www.hm.com/sg/products/ladies), the literal content will be returned in the very first phase(which is what you got from your urllib2 request), then the browser starts to read/parse the content, basically it tells the browser where to find all information it needs to render the whole page(e.g. CSS to control layout, additional javascript/urls/pages to populate certain area etc.), and the browser does all that behind the scene. When you "inspect element" in chrome, the page is already fully loaded, and those info you want is not in original url, so you need to find out which url is used to populate those area and go after that specific url instead.
So now we need to find out what happens behind the scene, and a tool is needed to capture all traffic when that page loads(I would recommend fiddler).
As you can see, lots of things happen when you open that page in a browser!(and that's only part of the whole page-loading process) So by educated guess, those info you need should be in one of those three "api.hm.com" requests, and the best part is they are alread JSON formatted, which means you might not even bother with BeautifulSoup, the built-in json module could do the job!
OK, now what? Use urllib2 to simulate those requests and get what you want.
P.S. requests is a great tool for this kind of job, you can get it here.

Try This one :
from bs4 import BeautifulSoup
import urllib2
page = urllib2.urlopen("http://www.hm.com/sg/products/ladies")
soup = BeautifulSoup(page.read(),'lxml')
scrapdiv = open('scrapdiv.txt','w')
product_lists = soup.findAll("div",{"class":"o-product-list"})
print product_lists
for product_list in product_lists:
print product_list
scrapdiv.write(str(product_list))
scrapdiv.write("\n\n")
scrapdiv.close()

Scraping Flipkart webpage using beautifulsoup

I am trying to scrape this page on Flipkart:
http://www.flipkart.com/moto-x-play/p/itmeajtqp9sfxgsk?pid=MOBEAJTQRH4CCRYM&ref=L%3A7224647610489585789&srno=p_1&query=moto+x+play&otracker=from-search
I am trying to find the div with class "fk-ui-ccarousel-supercontainer same-vreco-section reco-carousel-border-top sameHorizontalReco" but it returns empty result.
from bs4 import BeautifulSoup
import requests
url = "http://www.flipkart.com/moto-x-play/p/itmeajtqp9sfxgsk?pid=MOBEAJTQRH4CCRYM&ref=L%3A7224647610489585789&srno=p_1&query=moto%20x%20play&otracker=from-search"
page = requests.get(url)
soup = BeautifulSoup(page.text)
divs = soup.find_all("div",{"class":"fk-ui-ccarousel-supercontainer same-vreco-section reco-carousel-border-top sameHorizontalReco"})
print divs
divs is empty. I copied the class name using inspect element.
I found the answer in this question. http://www.google.com/url?q=http%3A%2F%2Fstackoverflow.com%2Fquestions%2F22028775%2Ftried-python-beautifulsoup-and-phantom-js-still-cant-scrape-websites&sa=D&sntz=1&usg=AFQjCNFOZIMVyUDcUqNNuv-05Dp7P_L6-g

When you use requests.get(url) you load the HTML content of the url without JavaScript enabled. Without JavaScript enabled, the section of the page called 'customers who viewed this product also viewed' is never even rendered.
You can explore this behaviour by turning off JavaScript in your browser. If you scrape regularly, you might also want to download a JavaScript switcher plugin.
An alternative that you might want to look into is using a browser automation tool such as selenium.

requests.get(..) will return the content that is the plain HTTP GET on that url. all the Javascript rels that the page contains will not be downloaded, also, any inline javascript will not be executed either.
If flipkart uses js to modify the DOM after it is loaded in the browser, those changes will not reflect in the page.contents or page.text values.
you could try a different parser instead of the default parser in beautiful soup. I tried html5lib and it worked for a different website. maybe it will for you too. It will be slower than the default parser, but could be faster than selenium or other full fledged headless browsers.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.