Possible to get blocked from scraping a site? - python

I am writing a code in python using selenium and Beautiful Soup to scrape Upwork for job listings and descriptions. I now keep getting an error that says:
"Access to this web page is denied. Please make sure your browser
supports JavaScript and cookies and that you are not blocking them
from loading. To learn more about how Upwork uses cookies please
review our Cookie Policy."
Do they not want people to scrape their sites?

You may have to do
clear the cache and cookies
disable all the ad blockers
For more information, check Upwork

Upwork have official API and Python lib as well so they might not be that keen on you scraping the website.
You can find the documentation here.
https://developers.upwork.com/?lang=python#
There is a jobs endpoint that does what you want.
https://developers.upwork.com/?lang=python#jobs

Related

Login with authenticated session with Scrapy

I am writing a web scraping project in Python using Scrapy. As a reference, my website I'm planning on scraping is https://umass.moonami.com/.
The problem is the login phase. Conventionally, when I login using a browser, it should redirect me to: https://login.microsoftonline.com/ (sending SAML request). However, in Scrapy, I can only reach to: https://webauth.umass.edu/idp/profile/SAML2/Redirect/SSO?execution=e1s1.
Can anyone help me figure why that is? Thank you very much.
In most cases, login with Scrapy or other similar libraries is almost impossible. (I'm not sure about this case)
So I suggest you to use headless browsers instead, there are two famous frameworks for this purpose:
Puppeteer (my recommendation) which is a Nodejs library: https://github.com/puppeteer/puppeteer
Selenium: https://selenium-python.readthedocs.io/
They will make your job much easier, but they will consume much more resources.

Cloudflare protection error 503 - "checking your browser"

I made a script to scrape data from a webpage which is cloudflare protected. I was scraping around 25k links from this website and the script was working fine. I have been able to extract all the links from this website and now want to scrape information from these links. Earlier the script was working well but because of recent security update in website, I am getting error 503 by requests library and "checking your browser" webpage by selenium. Is there any way to bypass it?
I also have scraper api subscription to make requests using proxies and using "scraper_api" library for the same.
I am sharing some of the links that needs to be scraped but getting these errors:
https://coinatmradar.com/bitcoin_atm/31285/bitcoin-atm-general-bytes-birmingham-altadena-spirits/
https://coinatmradar.com/bitcoin_atm/23676/bitcoin-atm-general-bytes-birmingham-marathon-gas/
Already tried other approaches like cfscraper, cloud scraper, undetected chromedriver, but no luck.
Please try to scrape any other link and share any solution. Thanks

Web scraping Access denied | Cloudflare to restrict access

I'm trying to access and get data from www.cclonline.com website using python script.
this is the code.
import requests
from requests_html import HTML
source = requests.get('https://www.cclonline.com/category/409/PC-Components/Graphics-Cards/')
html = HTML(html=source.text)
print(source.status_code)
print(html.text)
this is the errors i get,
403
Access denied | www.cclonline.com used Cloudflare to restrict access
Please enable cookies.
Error 1020
Ray ID: 64c0c2f1ccb5d781 • 2021-05-08 06:51:46 UTC
Access denied
What happened?
This website is using a security service to protect itself from online attacks.
how can i solve this problem? Thanks.
So the site's robots.txt does not explicitly says no bot is allowed. But you need to make your request look like it's coming from an actual browser.
Now to solve the issue at hand. The response says you need to have cookies enabled. So that can be solved by using a headless browser like selenium. Selenium has everything a browser has to offer (it basically uses google chrome or a browser of your chosen as a driver). It will make the server think the request is coming from an actual browser and will return a response.
Learn more about how to use selenium for scraping here.
Also remember to adjust crawl time accordingly. Make pauses after each request and swap user-agents often.
There’s no a silver bullet for solving cloudflare challenges, I’ve tried in my projects the solutions proposed here on this website, using playwright with different options https://substack.thewebscraping.club/p/cloudflare-how-to-scrape

Logging on to site to web scrape in Python

I want to scrape data from a website which has an initial log on (where I have working credentials). It is not possible to inspect the code for this, at is a log on that pops up before visiting the site. I tried searching around, but did not find any answer - perhaps I do not know what to search for.
This is what you get when going to the site:
Log on
Any help is appreciated :-)
The solution is to use the public REST API for the site.
If the web site does not provide a REST API for interacting with it you should not be surprised that your attempt at simulating a human is difficult. Web scraping is generally only possible for pages that do not require authentication or utilize the standard HTTP 401 status response to tell the client that it should prompt the user to respond with the correct credentials. If the site is using a different mechanism, most likely based on AJAX, then the solution is going to be specific to that web site or other sites using the same mechanism. Which means that no one can answer your question since you did not tell us which web site you are interacting with.
Based on your screenshot this looks like it is just using Basic Auth.
Using the library "requests":
import requests
session = requests.Session()
r = session.get(url, auth=requests.auth.HTTPDigestAuth('user', 'pass'))
Should get you there.
I couldn't get Tom's answer to work but I found a work around:
from selenium import webdriver
driver = webdriver.Chrome('path to chromedriver')
driver.get('https://user:password#webaddress.com/')
This worked :)

Scraping Ajax - Using python

I'm trying to scrap a page in youtube with python which has lot of ajax in it
I've to call the java script each time to get the info. But i'm not really sure how to go about it. I'm using the urllib2 module to open URLs. Any help would be appreciated.
Youtube (and everything else Google makes) have EXTENSIVE APIs already in place for giving you access to just about any and all data you could possibly want.
Take a look at The Youtube Data API for more information.
I use urllib to make the API requests and ElementTree to parse the returned XML.
Main problem is, you're violating the TOS (terms of service) for the youtube site. Youtube engineers and lawyers will do their professional best to track you down and make an example of you if you persist. If you're happy with that prospect, then, on you head be it -- technically, your best bet are python-spidermonkey and selenium. I wanted to put the technical hints on record in case anybody in the future has needs like the ones your question's title indicates, without the legal issues you clearly have if you continue in this particular endeavor.
Here is how I would do it: Install Firebug on Firefox, then turn the NET on in firebug and click on the desired link on YouTube. Now see what happens and what pages are requested. Find the one that are responsible for the AJAX part of page. Now you can use urllib or Mechanize to fetch the link. If you CAN pull the same content this way, then you have what you are looking for, then just parse the content. If you CAN'T pull the content this way, then that would suggest that the requested page might be looking at user login credentials, sessions info or other header fields such as HTTP_REFERER ... etc. Then you might want to look at something more extensive like the scrapy ... etc. I would suggest that you always follow the simple path first. Good luck and happy "responsibly" scraping! :)
As suggested, you should use the YouTube API to access the data made available legitimately.
Regarding the general question of scraping AJAX, you might want to consider the scrapy framework. It provides extensive support for crawling and scraping web sites and uses python-spidermonkey under the hood to access javascript links.
You could sniff the network traffic with something like Wireshark then replay the HTTP calls via a scraping framework that is robust enough to deal with AJAX, such as scraPY.

Categories