Scraping text that unhides on Button click with beautifulsoup and python?

Scraping text that unhides on Button click with beautifulsoup and python? - python

So I am trying to scrape with the following URL:
Website
The page has some hidden text that unlocks after a click.
Their HTML code is also hidden and unhides after button clicks.
Before click:
After click:
How can I scrape this text?
BeautifulSoup doesn't work on this text.

If you open dev tools and click those buttons, you can see that you make a post request to https://en.indonetwork.co.id/ajax.
So you can either try to replicate that - see if you can capture the payload sent in the post request from a scrape of the home page and send that.
Or you could use selenium to load the page, click the button, and then capture the data.

It is not working with beautifulsoup because it is not static site. I mean when you click the phone button, it sends the request to api endpoint and then renders the response from that request. You can check this in network tab in dev tools.(I confirmed this)
BeautifulSoup only retrieves the first static html from request. It does not takes account of requests triggered by user interaction.
Solution of this is selenium.
Here are the exact steps you can follow to get this done.
Load the selenium with headerful browser.(headerful browser allows you to interact with web page easily)
Find the phone button and click on it.
Wait for some time until request gets processed and has been rendered on screen.
Then you can grab the content of the element as per your requirement.
Not so good solution
You can directly send the request to that exact same api endpoint. But it will have some security barriers like cors to go over from.
This is not good solution because api endpoint might get change or as this api call contains phone number they can make this more secure for future usage. But the interaction on web page nearly remains the same.

you don't need scraping, there is ajax call happening under the hood
import requests
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://en.indonetwork.co.id/company/surrama-groups').text)
v = soup.find(class_='btn btn-contact btn-contact-phone btn-contact-ajax').attrs
data_id = v['data-id']
data_text = v['data-text']
data_type = v['data-type']
data = requests.post('https://en.indonetwork.co.id/ajax', json={
'id': data_id,
'text': data_text,
'type': data_type,
'url': "leads/ajax"
}).json()
mobile_no = re.findall(r'(\d+)', data['text'])
print(mobile_no) #['622122520556', '6287885720277']

Related

Cannot select HTML element with BeautifulSoup

Novice web scraper here:
I am trying to scrape the name and address from this website https://propertyinfo.knoxcountytn.gov/Datalets/Datalet.aspx?sIndex=1&idx=1. I have attempted the following code which only returns 'None' or an empty array if I replace find() with find_all(). I would like it to return the html of this particular section so I can extract the text and later add it to a csv file. If the link doesn't work, or take to you where I'm working, simply go to the knox county tn website > property search > select a property.
Much appreciation in advance!
from splinter import Browser
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
from webdriver_manager.chrome import ChromeDriverManager
owner_soup = soup(html, 'html.parser')
owner_elem = owner_soup.find('td', class_='DataletData')
owner_elem
OR
# this being the tag and class of the whole section where the info is located
owner_soup = soup(html, 'html.parser')
owner_elem = owner_soup.find_all('div', class_='datalet_div_2')
owner_elem
OR when I try:
browser.find_by_css('td.DataletData')[15]
it returns:
<splinter.driver.webdriver.WebDriverElement at 0x11a763160>
and I can't pull the html contents from that element.

There's a few issues I see, but it could be that you didn't include your code as you actually have it.
Splinter works on its own to get page data by letting you control a browser. You don't need BeautifulSoup or requests if you're using splinter. You use requests if you want the raw response without running any of the things that browsers do for you automatically.
One of these automatic things is redirects. The link you provided does not provide the HTML that you are seeing. This link just has a response header that redirects you to https://propertyinfo.knoxcountytn.gov/, which redirects you again to https://propertyinfo.knoxcountytn.gov/search/commonsearch.aspx?mode=realprop, which redirects again to https://propertyinfo.knoxcountytn.gov/Search/Disclaimer.aspx?FromUrl=../search/commonsearch.aspx?mode=realprop
On this page you have to hit the 'agree' button to get redirected to https://propertyinfo.knoxcountytn.gov/search/commonsearch.aspx?mode=realprop, this time with these cookies set:
Cookie: ASP.NET_SessionId=phom3bvodsgfz2etah1wwwjk; DISCLAIMER=1
I'm assuming the session id is autogenerated, and the Disclaimer value just needs to be '1' for the server to know you agreed to their terms.
So you really have to study a page and understand what's going on to know how to do it on your own using just the requests and beautifulsoup libraries. Besides the redirects I mentioned, you still have to figure out what network request gives you that session id to manually add it to the cookie header you send on all future requests. You can avoid doing some requests, and so this way is a lot faster, but you do need to be able to follow along in the developer tools 'network' tab.
Postman is a good tool to help you set up requests yourself and see their result. Then you can bring all the set up from there into your code.

send a post request to a website with multiple form tags using requests in python

good evening,
im trying to write a programme that extracts the sell price of certain stocks and shares on a website called hl.co.uk
As you can imagine you have to search for the stock you want to see the sale price of.
my code so far is as follows:
import requests
from bs4 import BeautifulSoup as soup
url = "https://www.hl.co.uk/shares"
page = requests.get(url)
parsed_html = soup(page.content, 'html.parser')
form = parsed_html.find('form', id="stock_search")
input_tag = form.find('input').get('name')
submit = form.find('input', id="stock_search_submit").get('alt')
post_data = {input_tag: "fgt", "alt": submit}
i have been able to extract the correct form tag and the input names i require. but the website has multiple forms on this page.
how can i submit a post request to this website using the data i have in "post_data" to that specfic form in order for it to search the stockk/share that i desire and then give me the next page?
thanks in advance

Actually when you submit the form from the homepage, it redirect you to the the target page with an url looking like this, "https://www.hl.co.uk/shares/search-for-investments?stock_search_input=abc&x=56&y=35&category_list=CEHGINOPW", so in my opinion, instead of submitting the homepage form, you should directly call the target page with your own GET parameters, the url you're supposed to call will look like this https://www.hl.co.uk/shares/search-for-investments?stock_search_input=[your_keywords].
Hope this helped you

This is a pretty general problem which you can use google chrome's devtools to solve. Basically,
1- Navigate to the page where you have a form and bunch of fields.
In your case page should look like this:
2- Then choose XHR tab under Network tab which will filter out all Fetch and XHR requests. These requests are generally sent after a form submission and they return a JSON with resulting data most of the time.
3- Make sure you enable the checkbox on the top left Preserve Log so the list doesn't refresh when form is submitted.
4- Submit the form, then you'll see bunch of requests are being made. Inspect them to hopefully find what you're looking for.
In this case I found this URL endpoint which gives out the results as response.
https://www.hl.co.uk/ajax/funds/fund-search/search?investment=&companyid=1324&sectorid=132&wealth=&unitTypePref=&tracker=&payment_frequency=&payment_type=&yield=&standard_ocf=&perf12m=&perf36m=&perf60m=&fund_size=&num_holdings=&start=0&rpp=20&lo=0&sort=fd.full_description&sort_dir=asc&
You can see all the query parameters here as companyid, sectorid what you need to do is change those and just make a request to URL. Then you'll get the relevant information.
To retrieve those companyid and sectorid values you can send a get request to the page https://www.hl.co.uk/shares/search-for-investments?stock_search_input=ftg&x=17&y=23&category_list=CEHGINOPW which has those dropdowns and filter the html to find these values in the screenshot below:
You can see this documentation for BS4 to find tags inside HTML source, https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find

How to get request headers in python scrapy

How to get request headers in python scrapy for a dynamic table data?
The request is made when a button is clicked.
Is there is a way to get the data without using click simulation?

You can extract the request from the website using the browser network tab. Observe the network tab to check which request is made when the button is clicked.
After that you can make the same request to the API in scrapy using the Request class.
Something like this:
yield scrapy.Request("http://www.example.com/some_page.html",
headers={'header', 'value'},
callback=self.parse_table
)
More info on Request here

Scrape displayed data through onclick using Selenium and Scrapy

I'm doing a script in python using Scrapy in order to scrape data from a website using an authentication.
The page I'm scraping is really painful because mainly made with javascript and AJAX requests. All the body of the page is put inside a <form> that allow to change the page using a submit button. URL don't change (and it's a .aspx).
I have successfully made that scrape all the data I need from page one, then changing page clicking on this input button using this code :
yield FormRequest.from_response(response,
formname="Form",
clickdata={"class":"PageNext"},
callback=self.after_login)
The after_login method is scraping the data.
However I need data that appear in another div after clicking on a container with onclick attribute. I need to do a loop in order to click on each container, displaying the data, scraping them and just after that I'm going to the next page and do the same process.
The thing is I can't find how to make the process where "the script" just click on the container using Selenium (while being logged in, if not I cannot go to this page) and then Scrapy is scraping the data that after the XHR request has been made.
I did a lot of research on the internet but could not try any solution.
Thanks !

Ok so I've almost got what I want, following #malberts advices.
I've used this kind of code in order to get the Ajax response request :
yield scrapy.FormRequest.from_response(
response=response,
formdata={
'param1':param1value,
'param2':param2value,
'__VIEWSTATE':__VIEWSTATE,
'__ASYNCPOST':'true',
'DetailsId':'123'},
callback=self.parse_item)
def parse_item(self, response):
ajax_response = response.body
yield{'Response':ajax_response}
The response is suppose to be in HTML. Thing is the response is not totally the same as the one when I lookup to the Response request through Chrome Dev Tools. I've not taken all the form data into account yet (~10 / 25), could it be it needs all the form data even if they don't change depending the id ?
Thanks !

web scraping python <span> with id

I want to scrap data in the <span/> attribute for a given website using BeautifulSoup. You can see at the screenshot where it locates. However, the code that I'm using is just returning an empty list. I can't find the data in the list that I want. What am I doing wrong?
from bs4 import BeautifulSoup
from urllib import request
url = "http://144.122.167.229"
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
data = opener.open(url).read()
soup = BeautifulSoup(data, 'html.parser')
your_data = list()
for line in soup.findAll('span', attrs={'id': 'mc1_legend_value'}):
your_data.append(line.text)
for line in soup.findAll('span'):
your_data.append(line.text)
ScreenShot : https://imgur.com/a/z0vNh
Thank you.

The dashboard from the screenshot looks to me like something javascript would generate. If you can't find the tag in the page source, that means it was later added by some javascript code or your browser tried to fix some html which it considered broken or out of place.
Keep in mind that right now you're sending a request to a server and it serves you the plain html back. A browser would parse the html and execute any javascript code if it finds any. In your case, beautiful soup or urllib doesn't execute any javascript code. urllib fetches the html and beautiful soup makes it easier to parse and extract relevant information.
If you want to get the value from that tag, I recommend using a headless browser to render your page and just after that parse it's html through beautiful soup or any other parser.
Give a try to selenium: http://selenium-python.readthedocs.io/.
You can control your own browser programmatically. You can make it request the page for you, render it, save the new html in a variable, parse it using beautifoul soup and extract the values you're interested in. I believe that it already has it's own parser implemented which you can use directly to search for that tag.
Or maybe even scrapinghub's splash: https://github.com/scrapinghub/splash
If the dashboard communicates with a server in real-time and that value is continuously received from the server, you could take a look at what requests are sent to the server in order to get that value. Take a look in developer console under the networks tab. Press F12 to open the developer console and click on Network. Refresh the page and you should get all the request send to the server along with the responses. Requests sent by the javascript are usually XMLHttpRequests. Click on XHR in the Network tab to filter out any other requests. (These are instructions for Google Chrome. Firefox might differ a bit).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.