I'm trying to scrape some web page, but failed

I'm trying to scrape some web page, but failed - python

I'm trying to scrape (Zone-H) to get some information, I found the website will execute JavaScript (called SlowAES).
But the website seems to detect that it is using Selenium ChromeDriver, because I can't download z.js, so I can't connect to the website.
I've added these two experimental options:
chrome_options.add_experimental_option('excludeSwitches',["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension',False)
But I still cannot get it to scrape.

Related

Cloudflare protection error 503 - "checking your browser"

I made a script to scrape data from a webpage which is cloudflare protected. I was scraping around 25k links from this website and the script was working fine. I have been able to extract all the links from this website and now want to scrape information from these links. Earlier the script was working well but because of recent security update in website, I am getting error 503 by requests library and "checking your browser" webpage by selenium. Is there any way to bypass it?
I also have scraper api subscription to make requests using proxies and using "scraper_api" library for the same.
I am sharing some of the links that needs to be scraped but getting these errors:
https://coinatmradar.com/bitcoin_atm/31285/bitcoin-atm-general-bytes-birmingham-altadena-spirits/
https://coinatmradar.com/bitcoin_atm/23676/bitcoin-atm-general-bytes-birmingham-marathon-gas/
Already tried other approaches like cfscraper, cloud scraper, undetected chromedriver, but no luck.
Please try to scrape any other link and share any solution. Thanks

How do I retrieve JSON from web page without a dedicated .json URL?

I am building a web scrape for this website https://www.kucoin.com/news and I want to be able to pick up the latest news. For other websites, I would simply get the JSON and parse it into Python and constantly monitor it to see if there any changes. However, for this web page, there are no dedicated JSON URL I can access when I looked through the Network Tab in Google Chrome. Apparently, the JSON is on the same page as that web (when I view source, it is at the bottom). How can I extract the JSON itself so I can setup a web scrape / monitor for it?

How to approach web-scraping in python

I am new to python just started on python web-scraping. I have to scrape data from this realtor site
I need to scrape all the details op read-state agents according to their real-state agency;
For this on the web-browser I have to follow the following instructions
Go to this site
click on agency offices button, enter 4000 pin in search box and then submit.
then we get list of the agencies.
go to our team tab and then we get agents their.
then we have to go to each agents page and record their information.
Can anyone tell me how to approach this. Whats the best way to make this type of scrapers.
Do i have to use selenium for the interaction with the pages.
I have worked on request, BeautifulSoup and simple form submit using mechanize

I would recommend on a searching site that you either use Selenium or Requests with sessions, the advantage of Selenium it it will probably work however it will be slow. For Selenium you should just use the Selenium IDE (Firefox add on) to record what you do then get the HTML from the webpage and use beautifulsoup to parse the data.
If you want to scrape the data quickly and without using much resources I usually use Requests with Sessions. To scrape a website like this you should open up a modern web browser (Firefox, Chrome) and use the network tools for that browser (usually located in developer tools or via right click inspect element). Once you are recording the network you can interact with the webpage to see the connections made to the server. In an example search they may use suggestions e.g
https://suggest.example.com.au/smart-suggest?query=4000&n=7&regions=false
The response then will probably be a JSON of the suggested results. Once you select a suggestion you can just submit a request with that search parameters e.g
https://www.example.com.au/find-agent/agents/petrie-terrace-qld-4000
The URLs for the agents will the be in that HTML page, you just then need to separately send a request to each page to get the information using BeautifulSoup.

You might wanna give Node and Jquery a try. I used to use Python all the time, but it gets messy and hard to maintain after a while.
Using node, you can turn the page HTML into a DOM object and then scrape all the data very easily using Jquery. I have done this for imdb here: “Using JQuery & NodeJS to scrape the web” #asimmittal https://medium.com/#asimmittal/using-jquery-nodejs-to-scrape-the-web-9bb5d439413b
You can modify this to scrape yelp

How to extract request url using python selenium

I am new to webscraping and need some help to extract a request-url from the online movie-stream website YIFY. I am familiar with how selenium works and I am trying to find the download url of the movie Revenant.
Using python-selenium I can click on the play icon and if you open your inspect element and go the network tab then you can see the request-url but you can't do inspect element and find it.
Download link - http://download1282.mediafire.com/3pvv1jx9z23g/crdad7bg0ghjh7r/vid.pdf
I am trying to extract this particular download link using python-selenium. Could anyone tell me if it is possible? Well I am not trying to download the movie but checking if it is possible to download the links from it. Here the links are not embedded in the html page, and I will highly appreciate any help.

Browser Simulation and Scraping with windmill or selenium, how many http requests?

I want to use windmill or selenium to simulate a browser that visits a website, scrapes the content and after analyzing the content goes on with some action depending of the analysis.
As an example. The browser visits a website, where we can find, say 50 links. While the browser is still running, a python script for example can analyze the found links and decides on what link the browser should click.
My big question is with how many http Requests this can be done using windmill or selenium. I mean do these two programs can simulate visiting a website in a browser and scrape the content with just one http request, or would they use another internal request to the website for getting the links, while the browser is still running?
Thx alot!

Selenium uses the browser but number of HTTP request is not one. There will be multiple HTTP request to the server for JS, CSS and Images (if any) mentioned in the HTML document.
If you want to scrape the page with single HTTP request, you need to use scrapers which only gets what is present in the HTML source. If you are using Python, check out BeautifulSoup.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

I'm trying to scrape some web page, but failed - python

Related

Cloudflare protection error 503 - "checking your browser"

How do I retrieve JSON from web page without a dedicated .json URL?

How to approach web-scraping in python

How to extract request url using python selenium

Browser Simulation and Scraping with windmill or selenium, how many http requests?

Categories

Resources