python web scrape but blocked - python

I'm trying to do web scraping with BeautifulSoup and requests libraries but I got blocked by website.
Instead of doing copy/paste from a website , I wanted to do it automatically so I tried with Python.
I just did a
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(class_='list-xxx')
I was trying to understand the html and when I went back on the website,
I was blocked.
How come ?
I did not send 1000 of requests.
Does it mean we can do web scraping ?
Thanks

This can be for many reasons. It's possible that you have moved to a country the website does not serve. Or you have violated their terms by sending too many requests.
In such cases the behavior you describe you can take what you see as an indication that the owners of the website either do not want you to scrape their information, or they assume the requests you have sent in their frequency an attempt to perform a DDoS (Distributed Denial of Service) attack.
If they do not want to allow scraping, then it's advisable to avoid doing it. However, if they do not have a problem with scraping, it's a good idea to contact them and ask them about their policy (if that's not public already) so you can comply to that and, if scraping is allowed, you can scrape the way you are not offending them.

Related

Is there a way to scrape URLs without scraping links?

Basically, I'm trying to download some images from a site which has been "down for maintenance" for over a year. Attempting to visit any page on the website redirects to the forums. However, visiting image URLs directly will still take you to the specified image.
There's a particular post that I know contains more images than I've been able to brute force by guessing possible file names. I thought, rather than typing every combination of characters possible ad infinitum, I could program a web scraper to do it for me.
However, I'm brand new to Python, and I've been unable to overcome the Javascript redirects. While I've been able to use requests & beautifulsoup to scrape the page it redirects to for 'href', without circumventing the JS I cannot pull from the news article which has links.
import requests
from bs4 import BeautifulSoup
url = 'https://www.he-man.org/news_article.php?id=6136'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
    print(link.get('href'))
I've added allow_redirects=False to every field to no avail. I've tried searching for 'jpg' instead of 'href'. I'm currently attempting to install Selenium, though it's fighting me, and I expect I'll just be getting 3xx errors anyway.
The images from the news article all begin with the same 62 characters (https://www.he-man.org/assets/images/home_news/justinedantzer_), so I've also thought maybe there's a way to just infinite-monkeys-on-a-keyboard scrape the rest of it? Or the type of file (.jpg)? I'm open to suggestions here, I really have no idea what direction to come at this thing from now that my first six plans have failed. I've seen a few mentions of scraping spiders, but at this point I've sunk a lot of time into dead ends. Are they worth looking into? What would you recommend?

Python scraping dynamic table

I tried with several different attempts to scrape the following page:
https://www.finanzen.ch/rohstoffe/historisch/weizenpreis/euro/17.4.2022_17.5.2022
Somehow, I'm not successful with request or selenium approach.
Those anybody has an idea how to scrape the data of the historical data table?
Thanks for your hints.
ThinkerBell
You can't bypass this website using simple requests.get, selenium/splash and even rotating-proxies won't work always. This is because, this website uses "Captcha services" and it knows how you are trying to access the page. The headers contains "Content-Disposition: form-data; name='recaptcha-token';" with a long cipher/encoded term, and since this term is based on your browsing activities, copy-pasting it in headers won't work either.
For such tricky websites, best option is to use browser based add-ons like "iMacro". You may also increase chances through Selenium, if you start browsing homepage and loading few more dummy links, before reaching the targeted link.

Pages not processing fully

I am trying to scrape news articles from yahoo finance and to do so, i want to use their sitemap page https://finance.yahoo.com/sitemap/
The problem i have is that after following a link https://finance.yahoo.com/sitemap/2015_04_02 for example scrapy does not process the whole page - only the header. So i cannot access the links to the different articles.
Is there some internal requests that i have to sent to the page ?
I still get the whole page by deactivating javascript in my browser and i use scrapy 1.6
Thanks.
Some sites take defensive measures against robots scraping their websites. If they detect that you are non-human, they may not serve the entire page. But more than likely what is happening is there is a bunch of client-side rendering that happens when you view the page in a web browser, which is not being executed when you request that same page in scrapy.
Yahoo! Finance has a API. Using that will probably get you more reliable results.

Parsing bot protected site

I am trying to parse the website "https://ih.advfn.com/stock-market/NYSE/gen-electric-GE/stock-price" and extract its most recent messages from its board. It is bot protected with Cloud-flare. I am using python and its relative libraries and this is what I have so far
from bs4 import BeautifulSoup as soup #parses/cuts the html
import cfscrape
import requests
url = 'https://ih.advfn.com/stock-market/NYSE/gen-electric-GE/stock-
price'
r=requests.get(url)
html = soup(r.text, "html.parser")
containers = html.find("div",{"id":"bbPosts"})
print(containers.text.strip())
I am not able to use the html parser because the site detects and blocks my script then.
My questions are:
How can I parse the web pages to pull the table data?
Might I mention that this is for a security class I am taking. I am not using this for malicious reasons.
There are multiple ways of bypassing the site protection. You have to see exactly how they are blocking you.
One common way of blocking requests is to look at the User Agent header. The client ( in your case the requests library ) will inform the server about it's identity.
Generally speaking, a browser will say I am a browser and a library will say I am a library. The server can then say I allow browsers but not libraries to access my content.
However, for this particular case, you can simply lie to the server by sending your own User Agent header.
You can see a example here. Try to use your browsers user agent.
Other blocking techniques include ip ranges. One way to bypass this is via a vpn. This is one of the easiest vpns to set up. Just spin up a machine on amazon and get this container running.
What else could happen, you might try to access a single page application that is not rendered server side. In this case, what you should receive with that get requests is a very small html file that essentially references a javascript file. If this is the case, what you need is a actual browser that you control programatically. I would suggest you look at Google Chrome Headless however there are others. You can also use Selenium
Web crawling is a beautiful but very deep subject. I think these pointers should set you on the right direction.
Also, as a quick mention, my advice is to avoid from bs4 import BeautifulSoup as soup. I would recommend html2text

Webscraping Financial Data from Morningstar

I am trying to scrape data from the morningstar website below:
http://financials.morningstar.com/ratios/r.html?t=IBM&region=USA&culture=en_US
I am currently trying to do just IBM but hope to eventually be able to type in the code of another company and do this same with that one. My code so far is below:
import requests, os, bs4, string
url = 'http://financials.morningstar.com/ratios/r.html?t=IBM&region=USA&culture=en_US';
fin_tbl = ()
page = requests.get(url)
c = page.content
soup = bs4.BeautifulSoup(c, "html.parser")
summary = soup.find("div", {"class":"r_bodywrap"})
tables = summary.find_all('table')
print(tables[0])
The problem I am experiencing at the moment is unlike a simpler webpage I have scraped the program can't seem to locate any tables even though I can see them in the HTML for the page.
In researching this problem the closest stackoverflow question is below:
Python webscraping - NoneObeject Failure - broken HTML?
In that one they explained that Morningstar's tables are dynamically loaded and used some json code I am unfamiliar with and somehow generated a different weblink which managed to scrape the data but I don't understand where it came from?
It's a real problem scraping some modern web pages, particularly on pages generated by single-page applications (where the content is maintained by AJAX calls and DOM modification rather than delivered as ready-to-go HTML in a single server response).
The best way I have found to access such content is to use the Selenium web testing environment to have a browser load the page under the control of my program, then extract the page contents from Selenium for scraping. There are other environments that will execute the scripts and modify the DOM appropriately, but I haven't used any of them.
It's not as difficult as it sounds, but it will take you a little jiggering around to get there.
Web scraping can be greatly simplified when the site offers an API, be it officially supported or just an unofficial hack. Even the hack is better than trying to fiddle with the HTML which can change every day.
So a search for morningstar api might be fruitful. And, in fact, some friendly Gister has already worked this out for you.
Would the search be without result, a usually fruitful approach is to investigate what ajax calls the page is doing to retrieve data and then issue them directly. This can be achieved via the browser debuggers, tab "network" or so where each request can be investigated in detail in a very friendly UI.
I've found scraping dynamic sites to be a lot easier with JavaScript than with Python + Selenium. There is a great module for nodejs/phantomjs: ScraperJS. It is very easy to use: it injects jQuery into the scraped page and you can extract data with jQuery selectors.

Categories