I'm trying to learn about webscraping and html, and I want to be able to do this in Jupyter Notebook for easy execution and exploration. My usual request methods haven't worked as I believe the website has javascript in it, so I've needed to try a different tact.
Most promising method I've tried so far has been to use Requests-HTML.
from requests_html import HTMLSession
URL = "https://www.analyticsinsight.net/17th-jan-updates-solana-cardano-avalanche-matic-litecoin-bitgert-centcex/?fbclid=IwAR0ZegSBRgL3VBFR4aJJqrfyi0xuQqPcZb3HovcgXzrK4xBb_Z-98JpgNlM"
session = HTMLSession()
r = session.get(URL)
r.html.render()
However, this doesn't work in a jupyter notebook and only when executed as a .py file. This is fine, but I was wondering how I could save the html data and be able to parse it in BeautifulSoup in Jupyter Notebook?
Related
What I'm trying to do:
from requests_html import HTMLSession
with HTMLSession() as s:
s.get('url', cookies=my_cookie_jar)
s.html.render()
print(s.html.html)
I want to access a page where I need to log-in. I already logged in using a selenium browser, where I then exported the cookies as a RequestsCookieJar.
Now when I print the text returned by the get-request, I receive the text of the correct webpage (but without the javescript rendered), but as soon as I render the html the cookies seem to have no effect and I get the html of a page asking me to log in (the same I get when issuing the request without the cookies in the first place).
Now my question:
Is it possible to specify the cookies when rendering the html (or should requests-html already do this by default)?
Yes, you can, by using the kwarg cookies in render method.
s.html.render(cookies=my_cookie_jar)
Solution, from Github (https://github.com/psf/requests-html/issues/109). Seems to work for me:
html.render(reload=False)
This is the code that I wrote. I watched lot of tutorials but they get the output with exactly the same code
import requests
from bs4 import BeautifulSoup as bs
url="https://shop.punamflutes.com/pages/5150194068881408"
page=requests.get(url).text
soup=bs(page,'lxml')
#print(soup)
tag=soup.find('div',class_="flex xs12")
print(tag)
I always get none. Also the class name seems strange. The view source code has different stuff than the inspect element thing
Bs4 is weird. Sometimes it returns different code than what is on the page...it alters it depending on the source. Try using selenium. It works great and has many more uses than bs4. Most of all...it is super easy to find elements on a site.
It's not a bs4 problem, it is correctly parsing what requests returns. It rather depends on the webpage itself
If you inspect the "soup", you will see that the source of the page is a set of links to scripts that render the content on the page. In order for these scripts to be executed, you need to have a browser - requests will only get you what the webserver returns, but won't execute the javascript for you. You can verify this yourself by deactivating javascript in the developer tools of your browser.
The solution is to use a web browser (e.g. headless chrome + chromedriver) and Selenium to control it. There are plenty of good tutorials out there on how to do this.
I'm practicing in parsing web pages with python. So what I do is
ans = requests.get(link)
Then I use re to extract some information from html, that is stored in
ans.content
What I faced is that some sites use scripts, that are automatically executed in a browser, but not when I try to download a page using requests. For example, instead of getting a page with information I get something like
scripts_to_get_info.run()
in html code
Browser is installed on my computer, so as a program that I wrote, this means that, theoretically, I should have a way to run this script and to get information while running python code to parse then.
Is it possible? Any suggestion?
(idea, that this is doable, came from the fact, that when I tried to inspect page in google, I saw real html file without any trashy scripts)
My goal for this python code is to create a way to obtain job information into a folder. The first step is being unsuccessful. When running the code I want the url to print https://www.indeed.com/. However instead the code returns https://secure.indeed.com/account/login. I am open to using urlib or cookielib to resolve this ongoing issue.
import requests
import urllib
data = {
'action':'Login',
'__email':'email#gmail.com',
'__password':'password',
'remember':'1',
'hl':'en',
'continue':'/account/view?hl=en',
}
response = requests.get('https://secure.indeed.com/account/login',data=data)
print(response.url)
If you're trying to scrape information from indeed, you should use the selenium library for python.
https://pypi.python.org/pypi/selenium
You can then write your program within the context of a real user browsing the site normally.
I'm building a webCrawler which needs to read links inside a webpage. For which I'm using urllib2 library of python to open and read the websites.
I found a website where I'm unable to fetch any data.
The URL is "http://www.biography.com/people/michael-jordan-9358066"
My code,
import urllib2
response = urllib2.urlopen("http://www.biography.com/people/michael-jordan-9358066")
print response.read()
By running the above code, the content I get from the website, if I open it in a browser and the content I get from the above code is very different. The content from the above code does not include any data.
I thought it could be because of delay in reading the web page, so I introduced a delay. Even after the delay, the response is the same.
response = urllib2.urlopen("http://www.biography.com/people/michael-jordan-9358066")
time.sleep(20)
print response.read()
The web page opens perfectly fine in a browser.
However, the above code works fine for reading Wikipedia or some other websites.
I'm unable to find the reason behind this odd behaviour. Please help, thanks in advance.
What you are experiencing is most likely to be the effect of dynamic web pages. These pages do not have static content for urllib or requests to get. The data is loaded on site. You can use Python's selenium to solve this.