Web scraping using VBA requires to show all nodes - python

I am unable to scrape complete data from a table on a website with around 1000 entries using
As seen in the image the code will require to show all nodes to scrape the complete data. Is there any way to do it?

Related

Python Financial Chart Scraping

Right now I'm trying to scrape the dividend yield from a chart using the following code.
df = pd.read_html('https://www.macrotrends.net/stocks/charts/BMO/Bank-of-Montreal/dividend-yield-history')
df = df[0].dropna()
But the code wont pick up the chart's data.
Any suggestions on pulling it from the website?
Here is the specific link I'm trying to use: https://www.macrotrends.net/stocks/charts/BMO/Bank-of-Montreal/dividend-yield-history
I've used the code for picking up the book values but the objects they're using for the dividends and book values must be different.
Maybe I could use Beautiful Soup?
Sadly that website is rendered dynamically, so there's nothing in the html pandas is getting to scrape from. (The chart is loaded after the page). Scraping manually isn't going to help you here, because the data isn't there. (It's fetched after the page is loaded.)
You can either find an api which provides the data (best, quite possible given the content), work out where the page is fetching its data from and see if you can get it directly (better if possible), or use something like selenium to control a real browser, render the page, get the html, and then use that.

Web scraping from dynamic websites in Python and Selenium

I asked a question yesterday and got an answer from #QHarr that dynamic websites like Workday (take https://wd1.myworkdaysite.com/recruiting/upenn/careers-at-penn for example) generate job posts' links by making extra XHR requests. So, if I want to extract specific job post links, the normal webpage scraping using HTML parse or CSS selector by keywords is not feasible while the links cannot be extracted from the HTML source code generated by the Selenium driver. (Based on WeiZhang2017's GitHub post: https://gist.github.com/Weizhang2017/0029b2ff59e943ca9f024c117fbdf88a)
In my case, websites like Workday using Ajax to load data while needed, I used Selenium to simulate page scroll down and get more data as needed. However, as for getting the JSON response using Selenium, I searched a lot but couldn't find an answer that fits my need.
My thought to extract specific job posts' links was by 3 steps in general:
Use Selenium to load and scroll down the website
Use a similar method like request .get().json() in Selenium to get the scrolled down website's JSON response data
Search through the JSON response data with my specific keywords to get the specific posts' links.
However, here comes my questions.
Step1: I did this by a loop to scroll down pages I want. No problem.
scroll = 3
while scroll:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(3)
scroll = scroll -1
Step2: I don't know what kind of method can work after searching a lot and couldn't find an easy-to-understand answer. (I am new to Python and Selenium, limited understanding of dynamic websites scraping )
Step3: I think I could handle the search and get what I want (specific job posts' links) once got the JSON data (assumed it named log) as shown on the Chrome Inspect-Network-Preview.
list = ['https://wd1.myworkdaysite.com' + x['title']['commonlink'] for x in log['body']['children'][0]['children'][0]['listItems'] if x['instance'][0]['text']==mySpecificWords]
Appreciate any thoughts on the step2 solutions.

Get many webpages updation alerts. Systematic automated web scraping

I have used google sheet function IMPORTXML for scraping specific parts of webpages but It's not working properly with long xpath, not fluent, not smooth on tons of websites URL.
I have also tried distill extension, excel scraping from web table but It is also not long term smooth solution.
Please help to get notified on changes / updation of specific parts of tons of webpages.

Web scraping of my Kibana server

I am running the ELK stack for the log analysis in which kibana is being used as the data visualization.Now I want to extract the some fields from the kibana webpage.
I want to extract the CU and count field and as you can see I have attached the screenshot of the webpage and corresponding html source code.
Now I have tried to scrap the same webpage using the python and "Beautiful soap" library but there whatever code I am seeing it is different.
Please help.also,
Can you suggest me some other method by which I can extract the required fields?
It's better to make direct request to your elasticsearch for the data you need.
You can see the query executed by visualization if you go to Dashboard and click the arrow in the bottom left corner and select Request tab:

Scrape PDFs inside viewer frame

(Complete begginer in web scraping here)
I'm trying to scrape the PDF from this webpage using python:
http://pesquisa.in.gov.br/imprensa/jsp/visualiza/index.jsp?jornal=3&pagina=1&data=31/03/1993
The problem is that the above URL points to the viewer (with date-page parameters), not the PDF file. I tried to inspect the html code to see the URL to the PDF directly, but could not.
any help on how to find the correct URL and implement a way to download them in python?
Edit:
I will later generalize this to other days and pages, the full list of day-page links can be found by searching for the relevant period here: http://portal.imprensanacional.gov.br/

Categories