Scrapy - Page Access Denied - python

I am looking for some help for scraping www.mobile.de, while I get an "Access Denied Page"
A regular spider results in the attached picture.
So far I have tried/recognized:
I am not blocked, since I can open the page in Firefox/Chrome
I allowed cookies
I used the same header as used currently by Firefox
I used a referer
I enabled/disabled "Obey robots.txt"
I used Splash to activate/render Javascript
So right now, I cannot conclude how the page detects that my program is a bot and how to avoid that.
https://ibb.co/7RsMkM3

Related

Source code is not complete because "JS is disabled in your browser"

I'm writing a python code to, at first, get a full source code of a web page to later scrape it. But when I try to get the source code - I see the aforementioned message ("If you're seeing this message, that means JavaScript has been disabled on your browser, please enable JS to make this app work") with partial html code. Also when I click F12 to see 'elements' the entire code appears meanwhile, pressing Cntrl + U to view the source code yields the same result as getting it with the below mentioned py script
source = requests.get(link).text
soup = BeautifulSoup(source, 'lxml').prettify()
I've seen similar questions to mine but none of them had a satisfactory solution, for example, it was recommended to use selenium to open a new web page and then to work with it, but it would take additional time. JS is enabled in my browser
It is as you have seen on the other answers, you have to use selenium (or another browser automation tool) to enable javascript rendering. The web page you are trying to access uses client side rendering, which means that the first thing it sends when you access the url is a bunch of javascript code. Then the browser executes the javascript code to create the DOM of the web page.
You are saying that javascript is enabled in the browser but that has nothing to do with your python code. The library you are using requests is sending a HTTP GET request to the server to fetch the web page, and the server replies as it would to any other request with the javascript that knows how to render the web page. That's why you need something like selenium, that runs a browser instead of doing a simple HTTP request.

Web scraping Access denied | Cloudflare to restrict access

I'm trying to access and get data from www.cclonline.com website using python script.
this is the code.
import requests
from requests_html import HTML
source = requests.get('https://www.cclonline.com/category/409/PC-Components/Graphics-Cards/')
html = HTML(html=source.text)
print(source.status_code)
print(html.text)
this is the errors i get,
403
Access denied | www.cclonline.com used Cloudflare to restrict access
Please enable cookies.
Error 1020
Ray ID: 64c0c2f1ccb5d781 • 2021-05-08 06:51:46 UTC
Access denied
What happened?
This website is using a security service to protect itself from online attacks.
how can i solve this problem? Thanks.
So the site's robots.txt does not explicitly says no bot is allowed. But you need to make your request look like it's coming from an actual browser.
Now to solve the issue at hand. The response says you need to have cookies enabled. So that can be solved by using a headless browser like selenium. Selenium has everything a browser has to offer (it basically uses google chrome or a browser of your chosen as a driver). It will make the server think the request is coming from an actual browser and will return a response.
Learn more about how to use selenium for scraping here.
Also remember to adjust crawl time accordingly. Make pauses after each request and swap user-agents often.
There’s no a silver bullet for solving cloudflare challenges, I’ve tried in my projects the solutions proposed here on this website, using playwright with different options https://substack.thewebscraping.club/p/cloudflare-how-to-scrape

How to make selenium-browser and manually opened browser have same behaviour?

I'm having a problem when trying to login to target.com with selenium. I have used firefox and chromium webdriver. It's always success with the browser that opened manually. But with selenium, it always failed.
The error happens when I submit the login form. It says "error T83072242".
I have attached the AJAX response that I get here.
After doing some analysis, I got a conclusion that these variable on request header is the one that caused the error. When I replace this variable with the one from the another browser(that I open manually), the ajax request is success.
So, how to make the selenium-browser behave like the normal browser?
Pardon for my english.

Scrapy shell doesn't crawl web page

I am trying to use Scrapy shell to try and figure out the selectors for zone-h.org. I run scrapy shell 'webpage' afterwards I tried to view the content to be sure that it is downloaded. But all I can see is a dash icon (-). It doesn't download the page. I tried to enter the website to check if my connection to the website is somehow blocked, but it was reachable. I tried setting user agent to something more generic like chrome but no luck there either. The website is blocking me somehow but I don't know how can I bypass it. I digged through the the website if they block crawling and it doesn't say it is forbidden to crawl it. Can anyone help out?
There is cookie issue with you spider, if you send your cookies with your request then you will get you desired data.
You can see that in attached picture.
Can you use scrapy shell "webpage" on another webpage that you know works/doesn't block scraping?
Have you tried using the view(response) command to open up what scrapy sees in a web browser?
When you go to the webpage using a normal browser, are you redirected to another, final homepage?
- if so, try using the final homepage's URL in your scrapy shell command
Do you have firewalls that could interfere with a Python/commandline app from connecting to the internet?

Scrapy for dynamic content

Can we use Scrapy for getting content from a web page which is loaded by Javascript?
I'm trying to scrape usage examples from this page,
but since they are loaded using Javascript as a JSON object I'm not able to get them with Scrapy.
Could you suggest what is the best way to deal with such issues?
Open your browser's developer tools and look at the Network tab. If you hit the "next" button on that page enough, it'll send out a new request:
After removing the JSONP paramter, the URL is pretty straightforward:
https://corpus.vocabulary.com/api/1.0/examples.json?query=unalienable&maxResults=24&startOffset=24&filter=0
By making the minimal number of requests, your spider will be fast.
If you want to just emulate a full browser and execute the JavaScript, you can use something like Selenium or Scrapinghub's Splash (and its corresponding Scrapy plugin).

Categories