urllib getting HTML but missing data - python

Right basically I'm getting and displaying HTML which displays the data I'm looking for just fine in a normal browser but not in a HTML dump with urllib.
Example URL: https://betfred.mobi/sports/horses/event/4315034.2
Example data: Horse names like "She Is No Lady"
Displays just fine under a browser. Doesn't need any login or preexisting cookies or anything.
I thought maybe it was waiting to see an actual user agent or something but that should be fine as well. I'm setting one and I've checked - it's working.
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36')]
response = opener.open("https://betfred.mobi/sports/horses/event/4315034.2")
print response.read()
It's showing something alright and I'm getting a HTML dump of the site but horse names for example are not showing up.
Am I missing something blindingly obvious here?

If you need to handle pages with Javascript, try WATIR or Selenium - those drive a real web browser, and can thus handle any Javascript. WATIR Classic requires either IE or Firefox with a certain extension installed, and you will see the pages flash on the screen as it works.
At present, Mechanize doesn't handle JavaScript.
Your other option would be understanding what the Javascript on the offending page does and bypassing it manually, but that seems onerous.

Related

Python requests.get only responds if I don't specify page number

I am scraping web data with python using requests and beautiful soup. I have found that 2 of the websites I am scraping from only respond if I do not specify the page number.
The following code works and allows me to extract the data needed:
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)}
r = requests.get('https://www.milkround.com/jobs/graduate-software-engineer', headers = headers)
soup = BeautifulSoup(r.content, 'html5lib')
table = soup.find('div', attrs = {'class':'col-xs-12 job-results clearfix'})
If however I change the link to specify a page number, such as:
r = requests.get('https://www.milkround.com/jobs/graduate-software-engineer?page=2', headers = headers)
Then request never responds. There is no error code, the console just waits indefinitely. What is causing this and how do I resolve it?
EDIT: I opened the site in Incognito manually. It seems that when opening with the page number I get an "access denied" response, but if I refresh the page it lets me in?
That's because if you see, you are not able to access the page numbers on website from outside. So if you are logged in and have some sort of cookie then add it to your headers.
What I just checked on website is you are trying to access wrong URI.There are no page numbers. Did you add ?page= from your own?
The problem you're tackling with is about web scraping. In your very case, the web page you have blocks because your header declaration lacks of a proper user-agent definition.
To get it to work you need to include a user-agent declaration like this:
headers={'user-agent':'Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522+ (KHTML, like Gecko) Safari/419.3',}
You can dive more deeply into the problem of writing good web scrapers here:
https://towardsdatascience.com/5-strategies-to-write-unblock-able-web-scrapers-in-python-5e40c147bdaf
A list of proper user-agents can be found here:
https://webscraping.com/blog/User-agents/
Hope it get's you working with your problem.

getting past ReadTimeout from Python Requests

I'm trying to scrape the Home Depot website using Python and requests. Selenium Webdriver works fine, but takes way too much time, as the goal is to make a time-sensitive price comparison tool between local paint shops and power tool shops.
When I send a request to any other website, it works like normal. If I use any browser to navigate manually to the website, it also works fine (with or without session data/cookie data). I tried adding randomized headers into the request, but it does not seem to help the issue. From what I can see, it's not an issue of sending too many requests per time-period, (considering that selenium and manual browsing still works at any time.) I am confident that this specific issue is NOT because of a rate limitation.
my code:
from random import choice
import requests
import traceback
list_desktopagents = ['Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36']
def random_headers():
return {'User-Agent': choice(list_desktopagents),
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}
response = requests.get(
'https://www.homedepot.com/p/BEHR-1-gal-White-Alkyd-Semi-Gloss-Enamel-Alkyd-
Interior-Exterior-Paint-390001/300831629',
headers=myheaders,
timeout=10)
my error:
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.homedepot.com', port=443): Read timed out. (read timeout=10)
Does anyone have a suggestion on what else I could do to successfully receive my response? I would prefer to use Requests, but anything that runs fast unlike selenium will be suitable. I understand that im being blocked, my question is not so much 'whats happening to stop me from scraping?', but rather, 'what can i do to further humanize my scraper so it allows me to continue?'
The error is coming from the User Agent. The reason why Selenium is working and not request is because Selenium is using a web driver to make the request, so it is more humanlike while request is much easier to be detected as a script. From Home Depot's robots.txt page it doesn't look like products are allowed for scraping. I just used this code and got a response by using this code:
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get('https://www.homedepot.com/p/BEHR-1-gal-White-Alkyd-Semi-Gloss-Enamel-Alkyd-Interior-Exterior-Paint-390001/300831629', headers=headers)
print(response.content)
By using these user agents you can "trick" the site into thinking you are an actual person, which is what the web driver with Selenium does.

python - login a page and get cookies

First, thanks for taking the time to read this and maybe trying to help me.
I am trying to make a script to easily login in a site. I wanted to get the login cookies too, so maybe I could reuse them later. I made the script and it logs me in correctly. But I can not get the cookies. When I try to print them, I see just this:
<RequestsCookieJar[]>
Obviously this can't help me, I think. So now I would be interested in knowing how to get the real cookie data. Thanks a lot to whoever can hep me reaching that.
My code:
import requests
import cfscrape
from bs4 import BeautifulSoup as bs
header = {"User-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"}
s = requests.session()
scraper=cfscrape.create_scraper(sess=s) #i use cfscrape because the page uses cloudflare anti ddos page
scraper.get("https://www.bstn.com/einloggen", headers=header)
myacc={"login[email]": "my#mail.com", #obviously change
"login[password]": "password123"}
entry=scraper.post("https://www.bstn.com/einloggen", headers=header, data=myacc)
soup=bs(entry.text, 'lxml')
accnm=soup.find('meta',{'property':'og:title'})['content']
print("Logged in as: " + accnm)
aaaa=scraper.get("https://www.bstn.com/kundenkonto", headers=header)
print(aaaa.cookies)
If I print the ccokies, I just get the <RequestsCookiesJar[]> like described earlier... It would be really nice if anyone could help me getting the "real" cookies
If you want to get your login cookie that you ought to use the response which after posting, because you are doing login action! Server will send back session cookies if you input correct email&password. And why you got empty cookies in aaaa is website didn't want to set or change your cookies.
entry = scraper.post("https://www.bstn.com/einloggen", allow_redirects=False, headers=header, data=myacc)
print(entry.cookies)

How do I know which browser is used to crawl in Scrapy framework?

What's my context:
As you know, website HTML structure on Chrome, Firefox, Safari are quite different. So when I'm using CSS-Selector to get data in an element tag from HTML structure, sometimes It that tag is already have with Chrome browser but the other is not. So that, I just want to focus on only one browser to reduce my effort.
When I crawl data from urls by using Scrapy framework, I don't know which browser will be used by Scrapy to crawl data. Therefore, I also don't know what kind of HTML response body be returned. I checked the response and I found that sometimes the structure is the same as getting from Chrome but sometimes It's not. It seems that Scrapy framework used many different web browsers to crawl data.
What I want:
I want to use only Chrome browser for crawling data in Scrapy framework
The structure of the HTML response body must be obtained from Chrome
What I ask:
Does anyone have any Ideas or tips to help me deal with that issue?
Can I config the Webdriver in Scrapy Framework as Selenium does? (If It's possible, please show me Where and How?)
Thank you!
Scrapy does not use Browser, it parser for static html like BeautifulSoup. if you want to parse dynamic page (javascript generated) use selenium and if you want you can send the page source to Scrapy.
To set Scrapy to use custom user agent (Chrome), in settings.py add
USER_AGENT = Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36
or in my_spider.py
class MySpider(scrapy.Spider):
def start_requests(self):
yield scrapy.Request(self.start_urls, callback=self.parse, headers={"User-Agent": "Your Custom User Agent"})
You can set the user agent in your setting file, something like this
USER_AGENT = 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
So for web server it will look like the request is generating from Chrome.

Python: urlopen not downloading the entire site

Greetings,
I have done:
import urllib
site = urllib.urlopen('http://www.weather.com/weather/today/Temple+TX+76504')
site_data = site.read()
site.close()
but it doesn't compare to viewing the source when loaded in firefox.
I suspected the user agent and did this:
class AppURLopener(urllib.FancyURLopener):
version = "Mozilla/5.0 (X11; U; Linux i686; zh-CN; rv:1.9.2.8) Gecko/20100722 Ubuntu/10.04 (lucid) Firefox/3.6.8"
urllib._urlopener = AppURLopener()
and downloaded it, but it still doesn't download the whole website.
Can someone please help me do user agent switching, if that is the likely culprit?
Thanks,
Narnie
It's more likely that there is an iframe in the code or that javascript is modifying the DOM. If theres an iframe, you'll have to parse the page to get the url for the iframe or just do it manually if it's a one-off. If it's javascript, I hear that selenium-rc is good but have no first hand experience with it.
downloaded page displayed locally may look different from several reasons, like that there are relative links (can be fixed adding e.g. <base href="http://www.weather.com/today/"> into the page head element), or non-functional ajax requests (see Ways to circumvent the same-origin policy).

Categories