GEO location leak with mechanicalsoup - python

i am trying to scrape some data from eBay.de, using a proxy which is located in Germany. I tried different webpages to double check it.
import mechanicalsoup
proxies = {"http": "http://.....",
"https": "https://...."}
browser = mechanicalsoup.StatefulBrowser()
browser.session.proxies = proxies
browser.set_user_agent(
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36')
browser.open('https://www.ebay.de/sch/internet.shop.world/m.html?_nkw=&_armrs=1&_ipg=&_from=/de')
browser.launch_browser()
if i am trying to use this code without VPN, but with proxy with my ip adress located outside from germany i get here just one article.
If i try the same with VPN with a german vpn server and without proxy, i get here alot more articles. Is there anything which a vpn server let ebay more believe the user is from germany then just a proxy?
the timezone is correct with the proxy.

try using accept-language in your header using your language:
headers = {# 'accept-language': 'en-GB,en-US;q=0.8,en;q=0.6,ms;q=0.4',
'accept-language': 'fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7',
}
browser.set_user_agent(
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36')
browser = mechanicalsoup.StatefulBrowser()
browser.session.proxies = {}
browser.open('https://www.ebay.de/sch/internet.shop.world/m.html?_nkw=&_armrs=1&_ipg=&_from=/de',headers=headers)
browser.launch_browser()

Related

Is there a way to find a browser's user-agent with cmd or python?

I am writing a code, where I have to use headless browser, but to access a specific website, I need to send user-agent as well. I am currently doing it by sending the following snippet of code(Python/Selenium/ChromeDriver).
opts = Options()
opts.add_argument("--headless")
opts.add_argument("--no-sandbox")
opts.add_argument("user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36")
But I wanted to make the user-agent genuine, instead of same for every browser/device where the code runs, thus I want to know the user-agent of browser on user's device.
So is there any way to find a browser's user-agent by using Python/Selenium code or command prompt?
httpagentparser extracts os, browser etc... information from http user agent string
so try this
import httpagentparser as agent
s = "user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"
print(agent.detect(s))

Error when web scraping is called through my personal server with python

I'm brand new to coding and have finished making a simple program to web scrape some stock websites for particular data. The simplified code looks like this:
headers = {'User-Agent': 'Personal_User_Agent'}
fv = f"https://finviz.com/quote.ashx?t=JAGX"
r_fv = requests.get(fv, headers=headers)
soup_fv = BeautifulSoup(r_fv.text, 'html.parser')
fv_ticker_title = soup_fv.find('title')
print(fv_ticker_title)
The website would not work until I created a user agent, but then it worked fine. I then created a website through python's local host which also worked fine, and so I thought I was ready to make the website public via "python anywhere".
However, when I went to create the public website, the program shuts down every time I go to access information through web scraping (i.e. using the user_agent). I didn't like the idea of using my user agent for a public domain, but I couldn't find out how other people who web scrape go about this problem when a user agent is required for a public domain. Any advice!?
I would add some random headers to rotate through rather than my own headers. Something like this should work:
import random
header_list = [
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_18_3) AppleWebKit/537.34 (KHTML, like Gecko) Chrome/82.0.412.92 Safari/539.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 11.12; rv:87.0) Gecko/20170102 Firefox/78.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.32 (KHTML, like Gecko) Chrome/82.0.12.17 Safari/535.42'
]
fv = f"https://finviz.com/quote.ashx?t=JAGX"
user_agent = random.choice(header_list)
headers = {'User-Agent': user_agent}
r_fv = requests.get(fv, headers=headers)
soup_fv = BeautifulSoup(r_fv.text, 'html.parser')
fv_ticker_title = soup_fv.find('title')
print(fv_ticker_title)
or Option 2.
Use a library called fake-headers to generate them off the cuff:
from fake_headers import Headers
fv = f"https://finviz.com/quote.ashx?t=JAGX"
headers = Headers(os="mac", headers=True).generate()
r_fv = requests.get(fv, headers=headers)
soup_fv = BeautifulSoup(r_fv.text, 'html.parser')
fv_ticker_title = soup_fv.find('title')
print(fv_ticker_title)
Really depends on whether you want to use a library or not...

403 forbidden error. can't access to this site

I want to scrape https://health.usnews.com/doctors/specialists-index while sending a request to this site through scrapy spider it shows status code as 403. In my request, I added user_agent but also it's not working.
I referred these two answer Python Doesn't Have Permission To Access On This Server / Return City/State from ZIP and 403:You don't have permission to access /index.php on this server but it's not working for me.
my user_agent is Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36. Some one help me to scrape the above mentioned site.
Try to add 'authority' in the headers as well. The below works for me in scrapy shell:
from scrapy import Request
headers = {
'authority': 'health.usnews.com',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
}
url = "https://health.usnews.com/doctors/specialists-index"
req = Request(url, headers=headers)
fetch(req)

Python Requests Get XML

If I go to http://boxinsider.cratejoy.com/feed/ I can see the XML just fine. But when I try to access it using python requests, I get a 403 error.
blog_url = 'http://boxinsider.cratejoy.com/feed/'
headers = {'Accepts': 'text/html,application/xml'}
blog_request = requests.get(blog_url, timeout=10, headers=headers)
Any ideas on why?
Because it's hosted by WPEngine and they filter user agents.
Try this:
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.152 Safari/537.36"
requests.get('http://boxinsider.cratejoy.com/feed/', headers={'User-agent': USER_AGENT})

Python requests vs. robots.txt

I have a script meant for personal use that scrapes some websites for information and until recently it worked just fine, but it seems one of the websites buffed up its security and I can no longer get access to its contents.
I'm using python with requests and BeautifulSoup to scrape the data, but when I try to grab the content of the website with requests, I run into the following:
'<html><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"></head><iframe src="/_Incapsula_Resource?CWUDNSAI=9_4E402615&incident_id=133000790078576866-343390778581910775&edet=12&cinfo=4bb304cac75381e904000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 133000790078576866-343390778581910775</iframe></html>'
I've done a bit of research, and it looks like this is what's stopping me: http://www.robotstxt.org/meta.html
Is there any way I can convince the website that I'm not a malicious robot? This is a script I run ~1 time per day on a single bit of source, so I'm not really a burden on their servers by any means. Just someone with a script to make things easier :)
EDIT: Tried switching to mechanize and ignoring robots.txt that way, but I'm not getting a 403 Forbidden response. I suppose they have changed their stance on scraping and have not updated their TOS yet. Time to go to Plan B, by no longer using the website unless anyone has any other ideas.
What is most likely happening is the Server is checking the user-agent and denying access to the default user-agent used by bots.
For example requests sets the user-agent to something like python-requests/2.9.1
You can specify the headers your self.
url = "https://google.com"
UAS = ("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1",
"Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:33.0) Gecko/20100101 Firefox/33.0",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36",
)
ua = UAS[random.randrange(len(UAS))]
headers = {'user-agent': ua}
r = requests.get(url, headers=headers)

Categories