fake geolocation with scrapy crawler - python

I am trying to scrape a website which serves different page depending upon the geolocation of the IP sending the request. I am using an amazon EC2 located in US(which means it serves up a page meant for US) but I want the page that will be served in India. Does scrapy provide a way to work around this somehow?

If the site you are scraping does IP based detection, your only option is going to be to change your IP somehow. This means either using a different server (I don't believe EC2 operates in India) or proxying your server requests. Perhaps you can find an Indian proxy service?

Related

My Flask python web application is not accessible from a domain name

I have programmed in python a web application using flask. I decided to deploy it from my home with a raspberry pi which runs the code continuously. I would like it to be accessible from a web browser and for that I have thought to configure my router to redirect the requests to my server. I also took care to configure my firewall accordingly. It turns out that the application works well and is perfectly accessible by typing its public IP in my URL bar. The problem is that I can't access it using a domain name. I have rented a new one and I have configured the DNS records so that it points to my server. I tested the DNS servers and it seems that the pointing is effective. However, when I enter the domain name I don't get my web application but the page of hostinger, the hosting company where I rented the domain name. I have contacted the technical department and they assure me that the problem is not in the DNS but in the hosting, so in my python code. This leaves me perplexed because my web application is accessible from its public IP. So the code should be good.
Please do not hesitate to ask me for additional details, either on the level of my python program or on the level of my server.
Thanks in advance for your help

how to ip proxy in django route

There are websites like, "XYZ.com" and "ABC.com" and those websites can only be accessible with such a range of ip addresses.
We have public IP and using that third party website are accessible. (with only office premises we can access those website)
I have to develop a website (djangoproxy.com) in django so that I can access those third party websites from outside the public ip range.
so i plan to access those website like,
XYZ.djangoproxy.com,
ABC.djangoproxy.com
There is one condition, that access is based on only authenticated users. So I have to write code on djangoproxy.com for an authentic user. and after successfully login open a third party website in the same browser tab.
Had checked some python packages for vpn,
https://gist.github.com/Lazza/bbc15561b65c16db8ca8
Reference:
Python requests, change IP address
Can you guide me if this functionality can be developed using python code or any web-server configuration.
I am using NGINX as a web-server to host djangoproxy.com.
You can use django-revproxy for your purpose. Your Django server will sit between the client and the external website. You can add the authentication logic of Django and allow proxy for only authenticated clients.

Open blocked site for scraping

I need to scrape a few details from a website and the problem is that the particular website is banned in India and I cannot open the site without a VPN but the VPN makes scrapping a lot slower and the program crashes a lot because the response time of the site increases. Is there any other way I can access the website?
Try this method, It's a private DNS that lets you access blocked websites. It is Faster and Better than a VPN.
Works only on Chrome:
Go to Chrome Settings.
Click on Security.
On the secure DNS, select Cloudflare (1.1.1.1).
For more details: https://asapguide.com/open-blocked-websites-without-vpn/
You can use scrape API https://www.scraperapi.com/ it provides you with a dynamic IP. It supports all the languages, you only need to attach the URL of scraper API at the beginning and your URL as param.

How can I pretend to be in a certain country during web scraping?

I want to scrape a website, but it should look like I am from a specific (let's say USA for this example) country (to make sure that my results are valid).
I am working in Python (Scrapy). And for scraping, I am using the rotating user agents (see: https://pypi.org/project/scrapy-fake-useragent-fix/).
The user agents are what I need to scrape. But can I use this, in combination with the request to pretend that I am in a specific country?
If there are some possibilities (in scrapy, Python) please let me know. Appreciated!
Example how I used the User Agents in my script
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
}
to pretent a certain country you need an IP from that country. Unfortunately this is nothing you can configure just by scrapy settings etc. But you could use a proxy service like crawlera:
https://support.scrapinghub.com/support/solutions/articles/22000188398-restricting-crawlera-ips-to-a-specific-region
Note: unfortunalty this service is not free and the cheapest plan is about 25 EUR. There are many other cheaper services available. The reason Crawlera is expensive is that they offer ban detection and only serve good IPs for your chosen domain. I've found them useful for the cost on Amazon and Google. Though on lesser domains a cheaper service with unlimited service would be more suitable.
You can do this using Selenium (Don't know about Scrapy), First tell the bot to go to this site :
Proxy Site
And then add your target site to search box and scrape .
Hello #helloworld1990,
Based on your requirement, say if you want to make each request using different IPs i.e. use IP Rotation (used when the site is detecting and blocking you after certain requests) then go for "Proxy Providers" there are many such providers you just need to google them.
If its not the case above, then for short term use you can try using TOR IPs. But TOR IPs are well known and are generally blocked. Else, you can still buy few static IPs from Proxy Providers and make the requests.
if(uniqueIpForEachRequestFromDifferentGeoLocations){
//go for proxy providers - IP Rotation
}else{
if(shortTermUse){
//go for tor nodes
}else{
//go for static IPs`enter code here`
}
}
Cheers! Hope this helps..

how to hide domain name?

I have developed a web interface for a system in django, which is running on my institution server (abc.edu). So the web address for the interface is http://def.abc.edu:8000/mysystem.
I am going to submit a paper about the system in a double blind conference (reviewers should not know which institution I am from). SO, I can not put the link http://def.abc.edu:8000/mysystem in my paper, I have to hide the domain name. Is there a way to do that in django, or in any other way? Any help will be appreciated.
As stated in the comments, this is not done using Django but using a DNS. The reason is simple: when you type an address in the URL bar of your browser, it asks a DNS to which IP does the domain of the URL correspond, which Django (or any other web framework) is oblivious of. Changing your address in Django will only change the URL on links which will become invalid.
Providing directly the IP of your server, as stated in the comments, won't provide any protection of any kind because universities IP addresses ranges are well known. Finding from which university a given IP comes from is easy.
The easiest way to achieve your need would be to get (for free or purchase) a DNS which redirects to your address. Dyndns.org, noip.com and similar DNS service providers gives you some features such as embedding your website in a frame to hide its address from the URL and similar tricks. Most of these tricks are pretty easy to deceive and discover the origin URL or address, though.
You may also host your project on another server, outside your university. Depending on the requirements of your web interface, some hosts may host you for free.

Categories