Scrape a website with interactive buttons - python

I am totally new into scraping a website.
I am trying to download the tables from https://www.ssa.gov/oact/NOTES/as120/LifeTables_Tbl_7.html
The way we use the website is to select a year from the button and press "Go", then a table for the selected year presented and I want to save the table.
I guess there should be a way to simulate human to select the year, for example, automatically select 1900 then press "go" , then loop for 100 times to record table from 1900 -2000. But I don't know how to simulate this human action.
I have know how to download the table once it is presented, but I just don't know how to let the table presented.
Thanks!

https://www.ssa.gov/oact/NOTES/as120/LifeTables_Tbl_7_**1950**.html
https://www.ssa.gov/oact/NOTES/as120/LifeTables_Tbl_7_**2030**.html
Like you see the only thing that changes is the year. So when you go to scrape a website. you need to scrape https://www.ssa.gov/oact/NOTES/as120/LifeTables_Tbl_7_" + TheYearIWant + ".html

Related

How to scrape data from website with calendar

I'm trying to get a list of all NBA games and the referees from each game.
This website (https://official.nba.com/referee-assignments/) has the referee information, but you have to click the "Date" button in the top-right and can only look at one date at a time. The data goes all the way from 12-02-2015 through today.
I'm a complete newbie to web scraping. I've put together some python code (using selenium) so far that will scrape the information I want, but I can only figure out how to do it for 1 day. Is there any way to automate it so it can scrape the info for all dates from 12-02-2015 through today?

Python - How to use scrape table from website with dropdown of available rows

I am trying to scrape the earnings calendar data from the table from zacks.com and the url is attached below.
https://www.zacks.com/stock/research/aapl/earnings-calendar
The thing is I am trying to scrape all data from the table, but it has a dropdown list to select 10, 25, 50 and 100 rows on a page. Ideally I want to scrape for all 100 rows but when I select 100 from the dropdown list, the url doesn't change. My code is below.
To note that the website blocks user-agent so I had to use chrome driver to impersonate human visiting the web. The obtained result from the pd.read_html is a list of all the tables and the d[4] returns the earnings calendar with only 10 rows (which I want to change to 100)
driver = webdriver.Chrome('../files/chromedriver96')
symbol = 'AAPL'
url = 'https://www.zacks.com/stock/research/{}/earnings-calendar'.format(symbol)
driver.get(url)
content = driver.page_source
d = pd.read_html(content)
d[4]
So calling help for anyone to guide me on this
Thanks!
UPDATE: it looks like my last post was downgraded due to lack of clear articulation and evidence of showing the past research. Maybe I am still a newbie to posting questions on this site. Actually, I have found several pages including this page with the same issue but the solutions didn't seem to work for me, which is why I came to post this as a new question
UPDATE 12/05:
Thanks a lot for the advise. As commented below, I finally got it working. Below is the code I used
dropdown = driver.find_element_by_css_selector('#earnings_announcements_earnings_table_length')
time.sleep(1)
hundreds = dropdown.find_element_by_xpath(".//option[. = '100']")
hundreds.click()
Having taken a look this is not going to be something that is easy to scrape. Given that the table is produced from the javascript I would say you have two options.
Option one:
Use selenium to render the page allowing the javascript to run. This way you can simply use the id/class of the drop down to interact with it.
You can then scrape the data by looking at the values in the table.
Option two:
This is the more challenging one. Look through the data that the page gets in response and try to find requests which result in the data you then see on the page. By cross-referencing these there will be a way to directly request the data you want.
You may find that to get at the data you want you need to accept a key from the original request to the page and then send that key as part of a second request. This way should allow you to scrape the data without having to run a selenium instance which will run more efficiently.
My personal suggestion is to go with option one as computer resources are cheap and developer time expensive.

How to Scrape data from pop-ups (i need to scrape data that is only visible once I click the popup, which is not a link)

I'm an absolute beginner in Python.
I need to scrape data from this website, which is a directory of professors
Some of the data are visible without the need to click, (names and school etc)
However I need to scrape email, department info as well.
I've been searching on the internet for the whole day and I don't know how to do it
Could anyone plz help?!
When you check the network activity, you'll see that the data is dynamically loaded from google spreadsheets. You can retrieve the spreadsheet directly without scraping.

Can't scrape data from webpage with popup/frame

I am having trouble finding elements on a customer-facing webpage that I am scraping data from, using Robot Framework + Selenium. My trouble, I think, has to do with the desired data existing in a popup/frame. The data I seek is located on a customer's invoice, which pops up when I press a button ("View Current Invoice"). I've been successful with logging into the site and navigating around, and at one point I was successful pressing the View Current Invoice button to cause the invoice to pop up - but forgot to commit that code and lost it. :-(
In any case, eve if I manually enter the popped up invoice by pressing the button when my script is expecting it to be pressed, I can't seem to scrape the subsequent data. I have tried to identify elements on the invoice using locators (from Right-Click-Inspect capability built into Firefox and Chrome; Katalaon Recorder; Selenium IDE; etc.). I get what looks like a valid locator (almost always Xpath); yet when I run my Robot script, it fails to find the element in question. I have spent a lot of time poring over the page's source code, but since I am not as savvy with HTML/JS/CSS as I should be, I haven't been successful.
Here is a screenshot of the invoice button:
And here is what I see when the button is pressed. I want to scrape all the invoice data, like Amount Due, Invoice Number, Due Date, etc.
Does anyone have any idea what I am missing here? What would you do to get the data on the invoice if you were in my shoes? I know my question probably sounds vague and naiive, but I am at the end of my rope, so to speak. I am willing to share page source code, more screenshots, whatever is required.
EDIT I used Rahul Rai's method to inspect the popup while it was popped up; then searched for "iframe". There were 10 matches; #7, when clicked on, resulted in the invoice popup being highlighted in blue:
I assume this means this is the iframe referencing the popup? If so, I should be able to find information about the "handle" to the iframe in the inspection code, but I don't see anything there that matches the locators I am used to (e.g. name, id, xpath). I even tried "Select Frame 1599252503952", but that just resulted in a
"Element with locator '1599252503952' not found" error.
As per above screen you have shared I can see your Invoice details are inside iframe. So after clicking on View Current Invoice button you can use below code to navigate inside frame and then scrape required information.
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,"//iframe[contains(#src,'invoice_detail_container']")))
#Code to scrape data
ele = driver.find_element_by_xpath('<xpath>')
print(ele.txt)
......
......
#After your work is done in this frame to navigate back to main window
driver.switch_to.default_content()
Note: I have assumed your main frame for invoice is not in side any other iframe ( Based on screen shared). Also before elements start there is no other nested frame. If there is any other nested frame you need to navigate first into that.
I was finally able to scrape data from the Invoice popup after inspecting the HTML source, and seeing this:
<iframe frameborder="0" src="/cmc/invoice_detail_container.pyt?direction=//my.hughesnet.com/cmc/invoice_detail.pyt%3Finvnumber%1234-567890&portletId=863" name="1599391562960" class="cboxIframe" allowfullscreen="true" webkitallowfullscreen="true" mozallowfullscreen="true"></iframe>
I was then able to use the Select Frame keyword in Robot Framework, specifying the iframe locator for the popup, using the 'class' strategy. I also had to explicitly select the main body frame first. In the end, the code that allowed me to enter the iframe and scrape was:
Select Frame body
Select Frame class:cboxIframe
Big thank you to Rahul Rai for pushing me closer to the solution; and thanks to the others who answered as well.
You need to switch your site to frame/popup, you can use like below example, may be it will help you.
IList<IWebElement> textfields = new List<IWebElement>();
textfields = driver.FindElements(By.TagName("iframe"));
driver.SwitchTo().Frame(textfields[count); // number of textfields list.
please try to implement as per your scenario, let me know if any question.
You can try to do :
driver.switch_to_active_element()
and then scrape the popup to close it. Then I think it will be okay...

Python Web Scraping with submission of Select and Radio Button

I am interested in getting a python script so as to visit the url"https://www.ura.gov.sg/realEstateIIWeb/resiRental/search.action#district", click on the tab "Search by property type and postal district" and finally choosing the parameters for:
Select the Date of Lease Commencement: "Sep-14" to "Aug-2017"
Select the Property type (radio buttons): "Landed properties"
Select the Postal District (i want to iterate and select all, but there is a limit of 5 at a time, hence ideally there should be a loop of 5s each time)
before submitting "Search".
After pressing search, the python script should click on "Download into CSV".
THe main problem i am facing is that i am not sure how to Click on Tabs or get python to fill up the "Select" and "Radio" buttons.
Any help is greatly appreciated.
Thanks!

Categories