Web scraping Access denied | Cloudflare to restrict access - python

I'm trying to access and get data from www.cclonline.com website using python script.
this is the code.
import requests
from requests_html import HTML
source = requests.get('https://www.cclonline.com/category/409/PC-Components/Graphics-Cards/')
html = HTML(html=source.text)
print(source.status_code)
print(html.text)
this is the errors i get,
403
Access denied | www.cclonline.com used Cloudflare to restrict access
Please enable cookies.
Error 1020
Ray ID: 64c0c2f1ccb5d781 • 2021-05-08 06:51:46 UTC
Access denied
What happened?
This website is using a security service to protect itself from online attacks.
how can i solve this problem? Thanks.

So the site's robots.txt does not explicitly says no bot is allowed. But you need to make your request look like it's coming from an actual browser.
Now to solve the issue at hand. The response says you need to have cookies enabled. So that can be solved by using a headless browser like selenium. Selenium has everything a browser has to offer (it basically uses google chrome or a browser of your chosen as a driver). It will make the server think the request is coming from an actual browser and will return a response.
Learn more about how to use selenium for scraping here.
Also remember to adjust crawl time accordingly. Make pauses after each request and swap user-agents often.

There’s no a silver bullet for solving cloudflare challenges, I’ve tried in my projects the solutions proposed here on this website, using playwright with different options https://substack.thewebscraping.club/p/cloudflare-how-to-scrape

Related

How to scrape infinitely scrolling websites with login using python request (or similar)

I would like to scrape a website that does not have an API and is an "infinite scroller". I have been using selenium for this, but now I need to scrape a lot more pages and do that all at once. The problem is that selenium is very resource-dependant since I am running a full (headless) chrome browser in each instance and also not stable at all (probably because of limited resources but still). I know that there is a way to look for ajax requests that the site uses and access it with requests library, but I have two issues:
I can't seem to find the desired request
The ones that I try to use with requests library require the user to be logged in and I have no idea how to do that (maybe pass cookies and whatnot, I am not a web developer).
Let me take Twitter as an example since it is exactly the as what I am describing (except it has an API). You have to log in and then the feed is loaded infinitely. So the goal is to "scroll" and take the content of each tweet. How can this be done? If you can, please, provide a working example.
Thank you.

How to login in a website While web scraping

I am making a web scraper that can bring back my YouTube channel stats in python , so I went to my YT studio site and copied the link and pasted it print the soup using bs4.I took the whole test that was printed and created an html file and when i looked at it , it was the YouTube login page.
So now i want to login into this(lets say i can provide the password and email id in a text file) in order to scrape the yt studio stats.I have no idea bout this (im new to web scraping)
You can use YouTube API, you don't need web scraping for this task.
you can use youTubeAPI to perform your operation. If you are still looking for a method to perform via web scraping below is the code for it
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
driver= webdriver.Chrome('')
driver.get('https://accounts.google.com/signin')
driver.find_element(By.XPATH,'//*[#id="identifierId"]').send_keys('xxxxxxxx#gmail.com');
driver.find_element(By.XPATH,'//*[#id="identifierNext"]/div/button').click();
driver.findElement(By.id("passwordNext")).click();
While doing via web scraping after entering an email address and trying to enter the password field, you may come across an error like below. It will happen because of multiple reasons like two-factor auth, not the secure browser.
you can disable two-factor auth for your login and give it a try with web scraping it will help
You likely login via a POST request. So you'll want to use a browser and login to YouTube while monitoring the Network using the browser. If you're using Firefox, it would be this, if you're using another browser it should have an equivalent. You'll want to find the form request it sends and then replicate that.
Although, if you're that new to web scraping, you might be better off starting with something easier or using YouTube's API.

Logging on to site to web scrape in Python

I want to scrape data from a website which has an initial log on (where I have working credentials). It is not possible to inspect the code for this, at is a log on that pops up before visiting the site. I tried searching around, but did not find any answer - perhaps I do not know what to search for.
This is what you get when going to the site:
Log on
Any help is appreciated :-)
The solution is to use the public REST API for the site.
If the web site does not provide a REST API for interacting with it you should not be surprised that your attempt at simulating a human is difficult. Web scraping is generally only possible for pages that do not require authentication or utilize the standard HTTP 401 status response to tell the client that it should prompt the user to respond with the correct credentials. If the site is using a different mechanism, most likely based on AJAX, then the solution is going to be specific to that web site or other sites using the same mechanism. Which means that no one can answer your question since you did not tell us which web site you are interacting with.
Based on your screenshot this looks like it is just using Basic Auth.
Using the library "requests":
import requests
session = requests.Session()
r = session.get(url, auth=requests.auth.HTTPDigestAuth('user', 'pass'))
Should get you there.
I couldn't get Tom's answer to work but I found a work around:
from selenium import webdriver
driver = webdriver.Chrome('path to chromedriver')
driver.get('https://user:password#webaddress.com/')
This worked :)

Possible to get blocked from scraping a site?

I am writing a code in python using selenium and Beautiful Soup to scrape Upwork for job listings and descriptions. I now keep getting an error that says:
"Access to this web page is denied. Please make sure your browser
supports JavaScript and cookies and that you are not blocking them
from loading. To learn more about how Upwork uses cookies please
review our Cookie Policy."
Do they not want people to scrape their sites?
You may have to do
clear the cache and cookies
disable all the ad blockers
For more information, check Upwork
Upwork have official API and Python lib as well so they might not be that keen on you scraping the website.
You can find the documentation here.
https://developers.upwork.com/?lang=python#
There is a jobs endpoint that does what you want.
https://developers.upwork.com/?lang=python#jobs

How do I make Python urlib2 to cleverly avoid the security check while trying to log into a site?

I am trying to crawl a website for the first time. I am using urllib2 Python
I am currently trying to log into Foursquare social networking site using Python urlib2 and Beautifulsoup. To view a particular page, I need to provide username and password.
So,I followed the Basic Authentication described on the ducumentation page.
I guess, everything worked well, but the site throws up a security check asking me to type a text (capcha), before sending me the required page. It obviously looks like, the site is detecting that, a page is being requested not by a human, but a crawler.
So, what is the way, to avoid being detected. How to make urllib2 get the desired page, without having to stop at the security check? Pls help..
You probably want to use foursquare API instead.
You have to use the foursquare API. I guess, there is no other way. API are designed for such purposes.
Crawlers depending solely on the HTML format of the page will fail in the furture when the HTML page changes

Categories