Unable to read HTML content - python

I'm building a webCrawler which needs to read links inside a webpage. For which I'm using urllib2 library of python to open and read the websites.
I found a website where I'm unable to fetch any data.
The URL is "http://www.biography.com/people/michael-jordan-9358066"
My code,
import urllib2
response = urllib2.urlopen("http://www.biography.com/people/michael-jordan-9358066")
print response.read()
By running the above code, the content I get from the website, if I open it in a browser and the content I get from the above code is very different. The content from the above code does not include any data.
I thought it could be because of delay in reading the web page, so I introduced a delay. Even after the delay, the response is the same.
response = urllib2.urlopen("http://www.biography.com/people/michael-jordan-9358066")
time.sleep(20)
print response.read()
The web page opens perfectly fine in a browser.
However, the above code works fine for reading Wikipedia or some other websites.
I'm unable to find the reason behind this odd behaviour. Please help, thanks in advance.

What you are experiencing is most likely to be the effect of dynamic web pages. These pages do not have static content for urllib or requests to get. The data is loaded on site. You can use Python's selenium to solve this.

Related

Why do i keep getting none when doing web scraping in python

This is the code that I wrote. I watched lot of tutorials but they get the output with exactly the same code
import requests
from bs4 import BeautifulSoup as bs
url="https://shop.punamflutes.com/pages/5150194068881408"
page=requests.get(url).text
soup=bs(page,'lxml')
#print(soup)
tag=soup.find('div',class_="flex xs12")
print(tag)
I always get none. Also the class name seems strange. The view source code has different stuff than the inspect element thing
Bs4 is weird. Sometimes it returns different code than what is on the page...it alters it depending on the source. Try using selenium. It works great and has many more uses than bs4. Most of all...it is super easy to find elements on a site.
It's not a bs4 problem, it is correctly parsing what requests returns. It rather depends on the webpage itself
If you inspect the "soup", you will see that the source of the page is a set of links to scripts that render the content on the page. In order for these scripts to be executed, you need to have a browser - requests will only get you what the webserver returns, but won't execute the javascript for you. You can verify this yourself by deactivating javascript in the developer tools of your browser.
The solution is to use a web browser (e.g. headless chrome + chromedriver) and Selenium to control it. There are plenty of good tutorials out there on how to do this.

Any way to run web scripts from html, that browser runs automaticaly on page download?

I'm practicing in parsing web pages with python. So what I do is
ans = requests.get(link)
Then I use re to extract some information from html, that is stored in
ans.content
What I faced is that some sites use scripts, that are automatically executed in a browser, but not when I try to download a page using requests. For example, instead of getting a page with information I get something like
scripts_to_get_info.run()
in html code
Browser is installed on my computer, so as a program that I wrote, this means that, theoretically, I should have a way to run this script and to get information while running python code to parse then.
Is it possible? Any suggestion?
(idea, that this is doable, came from the fact, that when I tried to inspect page in google, I saw real html file without any trashy scripts)

Unable to scrape data from dynamic page - Python Requests

I am working to scrape the reports from this site, I hit the home page, enter report date and hit submit, it's Ajax enabled and I am not getting how to get that report table . Any help will be really appreciated.
https://www.theice.com/marketdata/reports/176
I tried sending get and post using requests module, but failed as Session Time out or Report not Available.
EDIT:
Steps Taken so far:
URL = "theice.com/marketdata/reports/datawarehouse/..."
with requests.Session() as sess:
f = sess.get(URL,params = {'selectionForm':''}) # Got 'selectionForm' by analyzing GET requests to URL
data = {'criteria.ReportDate':--, ** few more params i got from hitting submit}
f = sess.post(URL,data=data)
f.text # Session timeout / No Reports Found –
Since you've already identified that the data you're looking to scrape is hidden behind some AJAX calls, you're already on your way to solving this problem.
At the moment, you're using python-requests for HTTP, but that is pretty much all it does. It does not handle executing JavaScript or any other items that involve scanning the content and executing code in another language runtime. For that, you'll need to use something like Mechanize or Selenium to load those websites, interact with the JavaScript, and then scrape the data you're looking for.

Redirected to main page when trying to parse html with python

from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
url = "http://www.csgolounge.com/api/mathes"
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data, "html.parser")
print (data)
I am trying to use this code to get the text from this page, but every time I try to scrape or get the text from the page, I am redirected to home page, and my code outputs the html from the homepage. The page I am trying to scrape is a .php file, and not an html or textfile. I would like to get the text from the page and then extract the data and do what I want with it.
I have tried changing the headers of my code, that the website would think that I am not a bot, but a chrome browser, but I still get redirected to the homepage. I have tried using diffrent html python parsers like BeautifulSoup, and the python built in class, as well as many other popular parsers, but they all give the same result.
Is there a way to stop this, and to get the text from this link? Is it a mistake in my code or what?
First of all, try it without the "www" part.
Rewrite http://www.csgolounge.com/api/mathes as https://csgolounge.com/api/mathes
If it doesn't work, try Selenium.
It may be getting stuck since it can't process the javascript part.
Selenium can handle it better.

Getting Twitter news feed using Python

Hey everybody, I've been playing around with Python 2.7 and BeautifulSoup 3.2 recently and I've gotten my code to work for Facebook where it makes a POST request to Facebook to login and downloads the HTML to the page, then saves it to my computer. I tried doing this with Twitter but it doesn't seem to be working*.. here's my code:
import urllib, urllib2
from BeautifulSoup import BeautifulSoup
#I've replaced my actual username and password for obvious reasons
form = urllib.urlencode({'username':'myUsername','password':'myPassword'})
response = urllib2.urlopen('https://mobile.twitter.com/session',form)
response = response.read()
Can anyone tell me what's wrong with it? Thanks!
*After I do response = response.read() I have it write to a file on my harddrive and open it with firefox. When I open it all I see is whatever is on http://mobile.twitter.com/ at the time of me running the script.
Don't use BeatufulSoup, use lxml instead http://lxml.de/ (much more powerfull, faster and convenient)
Don't grab twitter web-interface, use oficial twitter API instead http://dev.twitter.com/doc/get/statuses/home_timeline
You may want to check out the Twitter libraries for Python.

Categories