Python: urlopen not downloading the entire site - python

Greetings,
I have done:
import urllib
site = urllib.urlopen('http://www.weather.com/weather/today/Temple+TX+76504')
site_data = site.read()
site.close()
but it doesn't compare to viewing the source when loaded in firefox.
I suspected the user agent and did this:
class AppURLopener(urllib.FancyURLopener):
version = "Mozilla/5.0 (X11; U; Linux i686; zh-CN; rv:1.9.2.8) Gecko/20100722 Ubuntu/10.04 (lucid) Firefox/3.6.8"
urllib._urlopener = AppURLopener()
and downloaded it, but it still doesn't download the whole website.
Can someone please help me do user agent switching, if that is the likely culprit?
Thanks,
Narnie

It's more likely that there is an iframe in the code or that javascript is modifying the DOM. If theres an iframe, you'll have to parse the page to get the url for the iframe or just do it manually if it's a one-off. If it's javascript, I hear that selenium-rc is good but have no first hand experience with it.

downloaded page displayed locally may look different from several reasons, like that there are relative links (can be fixed adding e.g. <base href="http://www.weather.com/today/"> into the page head element), or non-functional ajax requests (see Ways to circumvent the same-origin policy).

Related

How to complete geetest (captcha) when scraping, by python-requests, while request values are taken by solving captcha manually?

I'm trying to scrape website, which use datadome and after some requests I have to complete geetest (slider captcha puzzle).
Here is a sample link to it:
captcha link
I've decided to don't use selenium (at least for now) and I'm trying to solve my problem by python module: Requests.
My idea was to complete geetest by myself then send the same request in my program, that my web browser is sending after completing that slider.
At the beginning, I've scraped html code which I got on website after captcha prompt:
<head><title>allegro.pl</title><style>#cmsg{animation: A 1.5s;}#keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style></head><body style="margin:0"><p id="cmsg">Please enable JS and disable any ad blocker</p><script>var dd={'cid':'AHrlqAAAAAMAsB0jkXrMsMAsis8SQ==','hsh':'77DC0FFBAA0B77570F6B414F8E5BDB','t':'fe','s':29701,'host':'geo.captcha-delivery.com'}</script><script src="https://ct.captcha-delivery.com/c.js"></script></body></html>
I couldn't access iframe where most important info is, but I found out that link to to that iframe can be build with info from that html code above. As u can see in link above:
cid is initialCid, hsh is hash etc., one part of the link, cid is a cookie that I got at the moment when captcha appeared.
I've seen there are available services which can solve captcha for u, so I've decided to complete captcha for myself, then send exact request, including cookies and headers, to my program then send request in my program by requests. For now I'm doing it by hand, but it doesn't work. Response is 403, when manually it's 200 and redirect.
Here is a sample request that my browser is sending after completing captcha:
sample request
I'm sending it in program by:
s = requests.Session()
s.headers = headers
s.cookies.set(cookie_from_web_browser)
captcha = s.get(request)
Response is 403 and I have no idea how to make it work, help me.
Captcha's are really tricky in the web scraping world, most of the time you can bypass this by solving the captcha and then manually taking the returned source's cookie and plugging it into your script. Depending on the website the cookie could hold for 15minutes, a day, or even longer.
The other alternative is to use captcha solving services such as https://www.scraperapi.com/ where you would have to pay a fee for x amount of requests but you won't run into the captcha issue as they solve them for you
Use a header parameter to solve this problem. Just like so
header = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" ,
'referer':'https://www.google.com/'
}
r = requests.get("http://webcache.googleusercontent.com/search?q=cache:www.naukri.com/jobs-in-andhra-pradesh",headers=header)
Test it with web cache before running with real url

get request returns 403 status code even after using header

I'm trying to scrape data from autotrader page and I managed to grab link to every offer on that page but when I'm trying to get data from every offer I get 403 requests status even though I'm using a header.
What more can I do to get past it?
headers = {"User Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/85.0.4183.121 Safari/537.36'}
page = requests.get("https://www.autotrader.co.uk/car-details/202010145012219", headers=headers)
print(page.status_code) # 403 forbidden
content_of_page = page.content
soup = bs4.BeautifulSoup(content_of_page, 'lxml')
title = soup.find('h1', {'class': 'advert-heading__title atc-type-insignia atc-type-insignia--medium '})
print(title.text)
[for people that are in the same position: autotrader uses cloudflare to protect every "car-detail" page, so I would suggest using selenium for example]
If you can manage to get the data via your browser, i.e. you somehow see this data in a website, then you can likely replicate that with requests.
Briefly, you need headers in your request to match the Browser's request:
Open dev tools in you browser (e.g. F12 or cmd+opt+I or click on menu)
Open Network tab
Reload the page (the whole website or the target request's url only, whatever provides a desired response from the server)
Find a http request to the desired url in the Network tab. Right click it, click 'Copy...', and choose the option (e.g. curl) you need.
Your browser sends tons of extra headers, you never know which ones are actually checked by the server so this technique will save you much time.
However, this might fail if there's some protection against blunt request copies, e.g. some temporary tokens, so the requests cannot be reused. In this case you need Selenium (browser emulation/automation), it's not difficult so it worth using.

urllib getting HTML but missing data

Right basically I'm getting and displaying HTML which displays the data I'm looking for just fine in a normal browser but not in a HTML dump with urllib.
Example URL: https://betfred.mobi/sports/horses/event/4315034.2
Example data: Horse names like "She Is No Lady"
Displays just fine under a browser. Doesn't need any login or preexisting cookies or anything.
I thought maybe it was waiting to see an actual user agent or something but that should be fine as well. I'm setting one and I've checked - it's working.
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36')]
response = opener.open("https://betfred.mobi/sports/horses/event/4315034.2")
print response.read()
It's showing something alright and I'm getting a HTML dump of the site but horse names for example are not showing up.
Am I missing something blindingly obvious here?
If you need to handle pages with Javascript, try WATIR or Selenium - those drive a real web browser, and can thus handle any Javascript. WATIR Classic requires either IE or Firefox with a certain extension installed, and you will see the pages flash on the screen as it works.
At present, Mechanize doesn't handle JavaScript.
Your other option would be understanding what the Javascript on the offending page does and bypassing it manually, but that seems onerous.

scraping google news headlines

Google news is searchable by keyword and then that search can be narrowed down to a certain time period.
I tried doing the search on the website and then using the url of the results page to reverse engineer the search in python thus:
import urllib2
url = 'https://www.google.com/search?hl=en&gl=uk&tbm=nws&authuser=0&q=apple&oq=apple&gs_l=news-cc.3..43j0l9j43i53.5710.6848.0.7058.5.4.0.1.1.0.66.230.4.4.0...0.0...1ac.1.SRcIeXL5d48'
handler = urllib2.urlopen(url)
html = handler.read()
however, i get a 403 error. This method works with other websites, such as bbc.co.uk. so obviously google does not want me to scrape the website with python.
so i have two questions:
1) is it possible to bypass this restriction google has placed? if so, how?
2) are there any other scrapeable news sites where i can search for news on a keyword for a given period.
for either of the options, i don't mind using a paid service. so such suggestions are welcome too.
thanks in advance,
K.
Try setting User-Agent
req = urllib2.Request(path)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3 Gecko/2008092417 Firefox/3.0.3')
response = urllib2.urlopen(req)

Downloading amazon.co.uk webpage using only python, html exactly as firebug sees it

I noticed that using urllib to download a webpage:
http://www.amazon.co.uk/Darkness-II-Limited-PC-DVD/dp/B005ULLEX6
the content that I get back using urlopen( url ).read() is different from what firebug sees.
Example:
If you point firebug to the page's image area, it tells you a div id="prodImageCell" exists, however when looking through what python has opened, there is no such thing, therefore beautifulsoup doesn't find anything.
Is this because the images are generated using javascript?
Question:
If so is there a way of downloading pretty much the exact same thing firebug sees using urllib (and not using something like Selenium instead).
I am trying to fetch the source url of one of the images programmatically, example here is the div with prodImageCell has src=http://ecx.images-amazon.com/images/I/51uPDvJGS3L.AA300.jpg which is indeed the url to the image.
Answer:
can't answer properly because I don't have the reputation :(
Found the solution thanks to #huelbois for pointing me in the right direction, one needs to use user-agents headers.
Before
>>> import urllib2
>>> import re
>>> site = urllib2.urlopen('http://www.amazon.co.uk/\
Darkness-II-Limited-PC-DVD/dp/B005ULLEX6').read()
>>> re.search( 'prodImageCell', site )
>>>
After
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/20100101\
Firefox/7.0.1"
>>> headers = {'User-Agent':user_agent}
>>> req = urllib2.Request(url=url,headers=headers)
>>> site = urllib2.urlopen(req).read()
>>> re.search( 'prodImageCell', site )
<_sre.SRE_Match object at 0x01487DB0>
hurrah!
Just tested it right now with wget (will work like urrlib).
You HAVE to include User-Agent header to get the requested part:
wget -O- --header='User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:9.0.1) Gecko/20100101 Firefox/9.0.1' http://www.amazon.co.uk/Darkness-II-Limited-PC-DVD/dp/B005ULLEX6
returns the html page with the requested part.
oops: just saw right now you succeeded with my previous advice. Great!

Categories