Get many webpages updation alerts. Systematic automated web scraping - python

I have used google sheet function IMPORTXML for scraping specific parts of webpages but It's not working properly with long xpath, not fluent, not smooth on tons of websites URL.
I have also tried distill extension, excel scraping from web table but It is also not long term smooth solution.
Please help to get notified on changes / updation of specific parts of tons of webpages.

Related

Python scraping dynamic table

I tried with several different attempts to scrape the following page:
https://www.finanzen.ch/rohstoffe/historisch/weizenpreis/euro/17.4.2022_17.5.2022
Somehow, I'm not successful with request or selenium approach.
Those anybody has an idea how to scrape the data of the historical data table?
Thanks for your hints.
ThinkerBell
You can't bypass this website using simple requests.get, selenium/splash and even rotating-proxies won't work always. This is because, this website uses "Captcha services" and it knows how you are trying to access the page. The headers contains "Content-Disposition: form-data; name='recaptcha-token';" with a long cipher/encoded term, and since this term is based on your browsing activities, copy-pasting it in headers won't work either.
For such tricky websites, best option is to use browser based add-ons like "iMacro". You may also increase chances through Selenium, if you start browsing homepage and loading few more dummy links, before reaching the targeted link.

Selenium won't load specific webpages on bet365

I am trying to scrape data from bet365 for basketball odds however I am encountering where certain leagues webpages wont load even when simply just loading the page and not automating anything else. It is very irregular as to which leagues will load and which won't but for a large portion of the leagues it is displaying the message "Sorry, this page is no longer available. Betting has closed or has been suspended." - I am fairly new to web scraping so am not using any tools to hide the fact it is an automated script so I'm unsure if the site is flagging for bot activity but then I'm unsure why some pages would still load if that was the case.
I am simply just using chromewebdriver and driver.get() links like bet365.com/#/AC/B18/C20814074/D48/E1453/F10 bet365.com/#/AC/B18/C20604387/D48/E1453/F10/
Both work however other links:
bet365.com/#/AC/B18/C20815826/D48/E1453/F10/
bet365.com/#/AC/B18/C20816919/D48/E1453/F10/
Don't work despite being near identical pages
Just looking for some insight into why certain leagues would be blocked and if there is any way to work around it.

How to Scrape data from pop-ups (i need to scrape data that is only visible once I click the popup, which is not a link)

I'm an absolute beginner in Python.
I need to scrape data from this website, which is a directory of professors
Some of the data are visible without the need to click, (names and school etc)
However I need to scrape email, department info as well.
I've been searching on the internet for the whole day and I don't know how to do it
Could anyone plz help?!
When you check the network activity, you'll see that the data is dynamically loaded from google spreadsheets. You can retrieve the spreadsheet directly without scraping.

Webscraping Financial Data from Morningstar

I am trying to scrape data from the morningstar website below:
http://financials.morningstar.com/ratios/r.html?t=IBM&region=USA&culture=en_US
I am currently trying to do just IBM but hope to eventually be able to type in the code of another company and do this same with that one. My code so far is below:
import requests, os, bs4, string
url = 'http://financials.morningstar.com/ratios/r.html?t=IBM&region=USA&culture=en_US';
fin_tbl = ()
page = requests.get(url)
c = page.content
soup = bs4.BeautifulSoup(c, "html.parser")
summary = soup.find("div", {"class":"r_bodywrap"})
tables = summary.find_all('table')
print(tables[0])
The problem I am experiencing at the moment is unlike a simpler webpage I have scraped the program can't seem to locate any tables even though I can see them in the HTML for the page.
In researching this problem the closest stackoverflow question is below:
Python webscraping - NoneObeject Failure - broken HTML?
In that one they explained that Morningstar's tables are dynamically loaded and used some json code I am unfamiliar with and somehow generated a different weblink which managed to scrape the data but I don't understand where it came from?
It's a real problem scraping some modern web pages, particularly on pages generated by single-page applications (where the content is maintained by AJAX calls and DOM modification rather than delivered as ready-to-go HTML in a single server response).
The best way I have found to access such content is to use the Selenium web testing environment to have a browser load the page under the control of my program, then extract the page contents from Selenium for scraping. There are other environments that will execute the scripts and modify the DOM appropriately, but I haven't used any of them.
It's not as difficult as it sounds, but it will take you a little jiggering around to get there.
Web scraping can be greatly simplified when the site offers an API, be it officially supported or just an unofficial hack. Even the hack is better than trying to fiddle with the HTML which can change every day.
So a search for morningstar api might be fruitful. And, in fact, some friendly Gister has already worked this out for you.
Would the search be without result, a usually fruitful approach is to investigate what ajax calls the page is doing to retrieve data and then issue them directly. This can be achieved via the browser debuggers, tab "network" or so where each request can be investigated in detail in a very friendly UI.
I've found scraping dynamic sites to be a lot easier with JavaScript than with Python + Selenium. There is a great module for nodejs/phantomjs: ScraperJS. It is very easy to use: it injects jQuery into the scraped page and you can extract data with jQuery selectors.

How to search for some specific links(which may be present in a pdf file) in a website and crawl those links for other information?

I have a task to complete. I need to make a web crawler kind of application. What i need to do is to pass a url to my application. This url is website of a government agency. This url also having some links to other individual agencies which are approved by this government agency. I need to go to those links and get some information from that site about that agency. I hope i make myself clear.Now i have to make this application generic. It means i can't hard code it for just one website(government agency). I need to make it like any url given to it , it should check it and then get all the links and proceed. Now in some website these links present in pdfs and in some they are present on a page.
I have to use python for this. I don't know how to approach this. I spend time on this using BeautifulSoup but that require lots of parsing. Other options are scrapy or twill. Honestly i am new to python. I dont know which one is better for this task. So any one can help me in selecting the right tool and right approach to solve this problem. Thanks in advance
There is plenty of information out there about building web scrapers with Python. Python is a great tool for the job.
There are also tons of posts about web scrapers on this website if you search for them.

Categories