Open blocked site for scraping - python

I need to scrape a few details from a website and the problem is that the particular website is banned in India and I cannot open the site without a VPN but the VPN makes scrapping a lot slower and the program crashes a lot because the response time of the site increases. Is there any other way I can access the website?

Try this method, It's a private DNS that lets you access blocked websites. It is Faster and Better than a VPN.
Works only on Chrome:
Go to Chrome Settings.
Click on Security.
On the secure DNS, select Cloudflare (1.1.1.1).
For more details: https://asapguide.com/open-blocked-websites-without-vpn/

You can use scrape API https://www.scraperapi.com/ it provides you with a dynamic IP. It supports all the languages, you only need to attach the URL of scraper API at the beginning and your URL as param.

Related

Selenium Python get data from HTTP request

I am running automation with Selenium and Python on Opera web driver, when I enter the specific page that I need, a request is sent to the server, it is authenticated with anti-content which blocks me from requesting it, then the only solution is to get the returned JSON after sending the request, I had checked selenium-wire, but I think it doesn't fit my needs, I thought if there is another way to do that, any suggestions?
You can try to use Titanium Web Proxy. It is a proxy server and can be installed via Nuget package and used with Selenium.
string body = await e.GetResponseBodyAsString();
Reference:
https://github.com/justcoding121/Titanium-Web-Proxy/issues/176
https://www.automatetheplanet.com/webdriver-capture-modify-http-traffic/#tab-con-9
Hello there are some pages which is created to be impossible automatize the request.
That rule works in JavaScript and there are companies which makes this detection and close the access for a bot.
So I am sorry to cannot solve your problem, I tried to do the same as You and there are not way.

Simulate active session on a website with python

I'm looking for a way to simulate an active session on a website using Python. What I mean by that, is I want to create a software, which makes the website think that an actual user with an actual browser has the website open. I've found urllib3 and it's request.urlopen methon, but it seems that this only reads the content provided from url and closes the connection. Thanks for any suggestions
You can try simulate browser requests to get necessary cookies for authentication. Google Chrome Dev Tools and requests python lib will do the job.
Some websites have another way to handle sessions, but I believe the majority is using cookies set through post requests.

Python scraping data that can only be accessed through Google OAuth login

I want to scrape some data from a website which uses Google OAuth for the authentication. Some data can only be accessed if I perform a login.
Basically, when you open the website (mamikos.com) and click login, there is no option for a normal login form, it will give you options to log in with facebook or google. It then will redirect you to the google login page. After login with google account, you then be redirected to website's homepage and all the data will easily be accessible with a simple click.
I am basically a noob, only know some basic coding and googling. I have looked everywhere but it seems like I am looking in the wrong place. I tried to write code with selenium to automate the click, passing username/password, and perform the login, but apparently, selenium is not the right tool for this as it will open a browser and do the stuff.
Is it possible to do this login and authentication process in the background? I got over a hundred thousand URL of pages in which I need the data from. Using selenium will crash my computer and will take a long time to finish.
Can someone here please show or at least point me into the right tools/library/method. or idk is it even possible?
Thanks
Please note that this answer is currently a work in progress - I'm working on (almost) the exact same problem, (different site, and I'll be using go), but I can provide a rough workaround for getting started, and when my solution matures I will update this.
Reiteration of problem statement
What you are asking for is a way for your scraper (third party client) to authenticate with a website (Resource Server) via google oauth (Authorization Server), to access resources that your specific account (Resource owner) has permission to see.
This sounds like three legged ouath.
"OAuth2 Simplified" is a nicely written article by Aaron Parecki giving a broad overview of the roles of client, resource owner, resource server, and authorization server in the three legged oauth process.
Another requirement (from what I can infer) is that the client you're implmenting/authenticating with is not trusted by the Authorization Server or the Resource Server.
This is significant, as it does preclude certain oauth flows from being usable, and may mean that various oauth client libraries/packages are not viable, as they may not implement flows for untrusted clients.
Workaround (Rough pass)
You identified selenium as a potential workaround for achieving authentication.
You accurately identified that selenium is not a great solution for large-scale scraping, as it is very heavyweight, relatively slow, & uses a lot of resources.
This being said, you only need to use selenium once in this process - To automate the oauth process to obtain an access token to use the website.
Once you get a token, you can discard your selenium instance & use your favourite high performance scraping library to carry out the rest of your tasks. From there, you can attach the token to your requests and recieve access.
This blog post describes this approach broadly, using a JS selenium API (Under "Use a automated UI Test to get the access token via Authorization Code Grant" )
I will provide more specifics once I implement them.
I understand it can be tough to scrape data from such websites which are behind log-in pages. You will need to learn the basic replication of request that is being sent to the server through requests library of Python. It can be daunting in the beginning but you can learn it step by step here.

Some websites block selenium webdriver, how does this work?

So I'm trying to web crawl clothing websites to build a list of great deals/products to look out for, but I notice that some of the websites that I try to load, don't. How are websites able to block selenium webdriver http requests? Do they look at the header or something. Can you give me a step by step of how selenium webdriver sends requests and how the server receives them/ are able to block them?
Selenium uses a real web browser (typically Firefox or Chrome) to make its requests, so the website probably has no idea that you're using Selenium behind the scenes.
If the website is blocking you, it's probably because of your usage patterns (i.e. you're clogging up their web server by making 1000 requests every minute. That's rude. Don't do that!)
One exception would be if you're using Selenium in "headless" mode with the HtmlUnitDriver. The website can detect that.
It's very likely that the website is blocking you due to your AWS IP.
Not only that tells the website that somebody is likely programmatically scraping them, but most websites have a limited number of queries they will accept from any 1 IP address.
You most likely need a proxy service to pipe your requests through.

fake geolocation with scrapy crawler

I am trying to scrape a website which serves different page depending upon the geolocation of the IP sending the request. I am using an amazon EC2 located in US(which means it serves up a page meant for US) but I want the page that will be served in India. Does scrapy provide a way to work around this somehow?
If the site you are scraping does IP based detection, your only option is going to be to change your IP somehow. This means either using a different server (I don't believe EC2 operates in India) or proxying your server requests. Perhaps you can find an Indian proxy service?

Categories