I am trying to write a python parser for this website 'http://www.topuniversities.com/university-rankings/world-university-rankings/2015#sorting=rank+region=+country=+faculty=+stars=false+search='
Every time I do the regular urlopen and print it, it says
'Access denied | www.topuniversities.com used CloudFlare to restrict access'.
After I tried this method
class MyOpener(FancyURLopener):
version = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
url = 'http://www.topuniversities.com/university-rankings/world-university rankings/2015#sorting=rank+region=+country=+faculty=+stars=false+search='
myopener = MyOpener()
page = myopener.open(url).read()
print page
But this prints out something other then what my chrome's inspect elements shows. I need to parse the names of the universities their rankings and the url that leads to their page.
What do I do? Please Help
Related
I am trying to use requests to make an api call that this page is making https://www.betonline.ag/sportsbook/martial-arts/mma.
requests.post(
url='https://api.betonline.ag/offering/api/offering/sports/offering-by-league',
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15'}
json={"Sport":"martial-arts","League":"mma","ScheduleText":None,"Period":0}
)
I have also tried including all the headers I see in the image but am still unable to get a 200 and get the response.
What am I missing?
I am trying to scrape the website in python, https://www.nseindia.com/
However when I try to load the website using Requests in python the call simply hangs below is the code I am using.
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
r = requests.get('https://www.nseindia.com/',headers=headers)
The requests.get call simply hangs, not sure what I am doing wrong here? The same URL works perfectly in Chrome or any other browser.
Appreciate any help.
i am trying to scrape some data from eBay.de, using a proxy which is located in Germany. I tried different webpages to double check it.
import mechanicalsoup
proxies = {"http": "http://.....",
"https": "https://...."}
browser = mechanicalsoup.StatefulBrowser()
browser.session.proxies = proxies
browser.set_user_agent(
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36')
browser.open('https://www.ebay.de/sch/internet.shop.world/m.html?_nkw=&_armrs=1&_ipg=&_from=/de')
browser.launch_browser()
if i am trying to use this code without VPN, but with proxy with my ip adress located outside from germany i get here just one article.
If i try the same with VPN with a german vpn server and without proxy, i get here alot more articles. Is there anything which a vpn server let ebay more believe the user is from germany then just a proxy?
the timezone is correct with the proxy.
try using accept-language in your header using your language:
headers = {# 'accept-language': 'en-GB,en-US;q=0.8,en;q=0.6,ms;q=0.4',
'accept-language': 'fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7',
}
browser.set_user_agent(
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36')
browser = mechanicalsoup.StatefulBrowser()
browser.session.proxies = {}
browser.open('https://www.ebay.de/sch/internet.shop.world/m.html?_nkw=&_armrs=1&_ipg=&_from=/de',headers=headers)
browser.launch_browser()
This is the first time I am trying requests.post() because I have always used requests.get(). So I'm trying to navigate to a website and search. I am using yellowpages.com, and before I get negative feedback about using the site to scrape or about an API, I just want to try it out. The problem I am running into is that it spits out some html that isn't remotely what I am looking for. I'll post my code below to show you what I am talking about.
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
url = "https://www.yellowpages.com"
search_terms = "Car Dealership"
location = "Jackson, MS"
q = {'search_terms': search_terms, 'geo_locations_terms': location}
page = requests.post(url, headers=headers, params=q)
print(page.text)
Your request boils down to
$ curl -X POST \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36' \
'https://www.yellowpages.com/?search_terms=Car Dealership&geo_locations_terms=Jackson, MS'
For this the server returns a 502 Bad Gateway status code.
The reason is that you use POST together wihy query parameters params. The two don't go well together. Use data instead:
requests.post(url, headers=headers, data=q)
If I go to http://boxinsider.cratejoy.com/feed/ I can see the XML just fine. But when I try to access it using python requests, I get a 403 error.
blog_url = 'http://boxinsider.cratejoy.com/feed/'
headers = {'Accepts': 'text/html,application/xml'}
blog_request = requests.get(blog_url, timeout=10, headers=headers)
Any ideas on why?
Because it's hosted by WPEngine and they filter user agents.
Try this:
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.152 Safari/537.36"
requests.get('http://boxinsider.cratejoy.com/feed/', headers={'User-agent': USER_AGENT})