unable to parse a webpage using python

unable to parse a webpage using python - python

I am trying to parse below webpage to get name of stocks hitting now all time high or low in the exchange.
https://www.bseindia.com/markets/equity/EQReports/HighLow.html?Flag=H#
however, when i download the webpage using beautiful soup and check the data i do not find the stock name or price mentioned in the webpage.
I wish to write a function to download the stocks that hit a new all time high each day please help what am i missing?

Part of the HTML on the page is generated dynamically by JavaScript. You are most likely using the requests library, which cannot handle HTML generated in this way.
What you can do, instead, is use the Selenium library, which allows you to launch an instance of a web browser controlled by Python, and get the page source from there.
from selenium import webdriver
path = '...' # path to driver here
url = 'https://www.bseindia.com/markets/equity/EQReports/HighLow.html?Flag=H#'
driver = webdriver.Chrome(path)
page_source = driver.get(url).page_source
By parsing page_source with BeautifulSoup, you can get what you want.

Related

Why Scraping a mobile url link using requests does not return all value?

I am trying to scrape this mobile link https://www.tokopedia.com/now/sumo-beras-putih-kemasan-merah-5-kg using a simple requests. That can only be open in app on mobile phone on tokopedia only.
It should return the price and product name however I am not finding it in the content of the request. Do I have to use selenium to wait for it to load? Please do help.
Currently the code is just a simple
resp = requests.get("https://www.tokopedia.com/now/sumo-beras-putih-kemasan-merah-5-kg", headers = {'User-Agent':'Mozilla/5.0'})
I tried searching for the price using in however it's not there. What should I do?

The reason you are unable to get all the data you are expecting is because this website uses javascript. What this means for you is that you need a scraping tool that is capable of rendering javascript.
What you are doing right now is fetching the raw data as your browser would receive it, however you are currently not doing anything with the code written on the website, hence why your data is incomplete.
For starters, I would recommend using Selenium for the job. It'll look something like this:
from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
driver.get('https://www.tokopedia.com/now/sumo-beras-putih-kemasan-merah-5-kg')
print(driver.page_source)
To get started with Selenium and its installation, I recommend this resource

Extract book names from oreilly media site using python beautiful soup

I am trying to extract book names from oreilly media website using python beautiful soup.
However I see that the book names are not in the page source html.
I am using this link to see the books:
https://www.oreilly.com/search/?query=*&extended_publisher_data=true&highlight=true&include_assessments=false&include_case_studies=true&include_courses=true&include_playlists=true&include_collections=true&include_notebooks=true&include_sandboxes=true&include_scenarios=true&is_academic_institution_account=false&source=user&formats=book&formats=article&formats=journal&sort=date_added&facet_json=true&json_facets=true&page=0&include_facets=true&include_practice_exams=true
Attached is a screenshot that shows the webpage with the first two books alongside with chrome developer tool with arrows pointing to the elements i'd like to extract.
I looked at the page source but could not find the book names - maybe they are hidden inside some other links inside the main html.
I tried to open some of the links inside the html and searched for the book names but could not find anything.
is it possible to extract the first or second book names from the website using beautiful soup?
if not is there any other python package that can do that? maybe selenium?
Or as a last resort any other tool...

So if you investigate into network tab, when loading page, you are sending request to API
It returns json with books.
After some investigation by me, you can get your titles via
import json
import requests
response_json = json.loads(requests.get(
"https://www.oreilly.com/api/v2/search/?query=*&extended_publisher_data=true&highlight=true&include_assessments=false&include_case_studies=true&include_courses=true&include_playlists=true&include_collections=true&include_notebooks=true&include_sandboxes=true&include_scenarios=true&is_academic_institution_account=false&source=user&formats=book&formats=article&formats=journal&sort=date_added&facet_json=true&json_facets=true&page=0&include_facets=true&include_practice_exams=true&orm-service=search-frontend").text)
for book in response_json['results']:
print(book['highlights']['title'][0])

To solve this issue you need to know beautiful soup can deal with websites that use plan html. so the the websites that use JavaScript in their page beautiful soup cant's get all page data that you looking for bcz you need a browser like to load the JavaScript data in the website.
and here you need to use Selenium bcz it open a browser page and load all data of the page, and you can use both as a combine like this:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import lxml
# This will make selenium run in backround
chrome_options = Options()
chrome_options.add_argument("--headless")
# You need to install driver
driver = webdriver.Chrome('#Dir of the driver' ,options=chrome_options)
driver.get('#url')
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
and with this you can get all data that you need, and dont forget to
write this at end to quit selenium in background.
driver.quit()

Python - Getting HTML with DOM

I have a flash card making program for Spanish that pulls information from here: http://www.spanishdict.com/examples/zorro (this is just an example). I've set it up so it gets the translations fine, but now I want to add examples. I noticed however, that the examples on that page are dynamically generated so I installed Beautiful Soup and HTML5 parser. The tag I'm specifically interested in is:
<span class="megaexamples-pair-part">Los perros siguieron el rastro del <span
class="megaexamples-highlight">zorro</span>. </span>
The code I'm using to try and retrieve it is:
soup = BeautifulSoup(urlopen("http://www.spanishdict.com/examples/zorro").read(), 'html5lib')
example = soup.findAll("span", {"class": "megaexamples-pair-part"})
However, no matter what way I swing it, I can't seem to get it to pull down the dynamically generated code. I have confirmed I get the page by doing a search for megaexamples-container, which works fine (and you can see by just right clicking in google chrome and hitting View Page Source).
Any ideas?

What you're doing is just pull the HTML page, and it's likely loading more data from the server via a JavaScript call.
You have 2 options:
Use a webdriver such as selenium to control a web browser that correctly loads the entire page (you can then parse it with BeautifulSoup or find elements with selenium's own tools). This incurs in some overhead due to the browser usage.
Use the network tab of your browser's developer tools (usually accessed with F12) to analyze incoming and outgoing requests from dynamic loading and use the requests module to replicate them. This is more efficient but might also be more tricky.
Remember to do this only if you have permission from the site's owner, though. In many cases it's against the ToS.

I used Pedro's answer to get me moving in the right direction. Here is what I did to get it to work:
Download selenium with pip install selenium
Download the driver for the browser you want to emulate. You can download them from this page. The driver must be in the PATH variable or you will need to specify the path in the constructor for the webdriver.
Import selenium with from selenium import webdriver
Now use the following code:
browser = webdriver.Chrome()
browser.get(raw_input("Enter URL: "))
html_source = browser.page_source
Note: If you did not put your driver in path, you have to call the constructor with browser = webdriver.Chrome(<PATH_TO_DRIVER_HERE>)
Note 2: You can use something like webdriver.Firefox() if you want a different browser.
Now you can parse it with something like: soup = BeautifulSoup(html_source, 'html5lib')

Executing a page's JavaScript at a low level with Python?

When this page is scraped with urllib2:
url = https://www.geckoboard.com/careers/
response = urllib2.urlopen(url)
content = response.read()
the following element (the link to the job) is nowhere to be found in the source (content)
Taking a look at the full source that gets rendered in a browser:
So it would appear that the FRONT-END ENGINEER element is dynamically loaded by Javascript. Is it possible to have this Javascript executed by urllib2 (or other low-level library) without involving e.g. Selenium, BeautifulSoup, or other?

The pieces of information are loaded using some ajax request. You could use firebug extension for mozilla or google chrome has it's own tool to get theese details. Just hit f12 in google chrome while opening the URL. You can find the complete details there.
There you will find a request with url https://app.recruiterbox.com/widget/13587/openings/
Information from the above url is rendered in that web page.

From what I understand, you are building something generic for multiple web-sites and don't want to go deep down in how a certain site is loaded, what requests are made under-the-hood to construct the page. In this case, a real browser is your friend - load the page in a real browser automated via selenium - then, once the page is loaded, pass the .page_source to lxml.html (from what I see this is your HTML parser of choice) for further parsing.
If you don't want a browser to show up or you don't have a display, you can go headless - PhantomJS or a regular browser on a virtual display.
Here is a sample code to get you started:
from lxml.html import fromstring
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.set_page_load_timeout(15)
driver.get("https://www.geckoboard.com/careers/")
# TODO: you might need a delay here
tree = fromstring(driver.page_source)
driver.close()
# TODO: parse HTML
You should also know that, there are plenty of methods to locate elements in selenium and you might not even need a separate HTML parser here.

I think you're looking for something like this: https://github.com/scrapinghub/splash

How to get the content from web browser using python?

I have a webpage :
http://kff.org/womens-health-policy/state-indicator/ultrasound-requirements/#
and I need to extract the table from this webpage.
Problem Encountered : I have been using BeautifulSoup and requests to get the url content. The problem with these methods is that I am able to get the web content even before the table is being generated.
So I get empty table
< table>
< thead>
< /thead>
< tbody>
< /tbody>
< /table>
My approach : Now I am trying to open the url in the browser using
webbrowser.open_new_tab(url) and then get the content from the browser directly . This will give the server to update the table and then i will be able to get the content from the page.
Problem : I am not sure how to fetch information from Web browser directly .
Right now i am using Mozilla on windows system.
Closest link found website Link . But it gives which sites are opened and not the content
Is there any other way to let the table load in urllib2 or beautifulsoup and requests ? or is there any way to get the loaded content directly from the webpage.
Thanks

To add to Santiclause answer, if you want to scrape java-script populated data you need something to execute it.
For that you can use selenium package and webdriver such as Firefox or PhantomJS (which is headless) to connect to the page, execute the scripts and get the data.
example for your case:
from selenium import webdriver
driver = webdriver.Firefox() # You can replace this with other web drivers
driver.get("http://kff.org/womens-health-policy/state-indicator/ultrasound-requirements/#")
source = driver.page_source # Here is your populated data.
driver.quit() # don't forget to quit the driver!
of course if you can access direct json like user Santiclause mentioned, you should do that. You can find it by checking the network tab when inspecting the element on the website, which needs some playing around.

The reason the table isn't being filled is because Python doesn't process the page it receives with urllib2 - so there's no DOM, no Javascript that runs, et cetera.
After reading through the source, it looks like the information you're looking for can be found at http://kff.org/datacenter.json?post_id=32781 in JSON format.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

unable to parse a webpage using python - python

Related

Why Scraping a mobile url link using requests does not return all value?

Extract book names from oreilly media site using python beautiful soup

Python - Getting HTML with DOM

Executing a page's JavaScript at a low level with Python?

How to get the content from web browser using python?

Categories

Resources