I'm doing a script in python using Scrapy in order to scrape data from a website using an authentication.
The page I'm scraping is really painful because mainly made with javascript and AJAX requests. All the body of the page is put inside a <form> that allow to change the page using a submit button. URL don't change (and it's a .aspx).
I have successfully made that scrape all the data I need from page one, then changing page clicking on this input button using this code :
yield FormRequest.from_response(response,
formname="Form",
clickdata={"class":"PageNext"},
callback=self.after_login)
The after_login method is scraping the data.
However I need data that appear in another div after clicking on a container with onclick attribute. I need to do a loop in order to click on each container, displaying the data, scraping them and just after that I'm going to the next page and do the same process.
The thing is I can't find how to make the process where "the script" just click on the container using Selenium (while being logged in, if not I cannot go to this page) and then Scrapy is scraping the data that after the XHR request has been made.
I did a lot of research on the internet but could not try any solution.
Thanks !
Ok so I've almost got what I want, following #malberts advices.
I've used this kind of code in order to get the Ajax response request :
yield scrapy.FormRequest.from_response(
response=response,
formdata={
'param1':param1value,
'param2':param2value,
'__VIEWSTATE':__VIEWSTATE,
'__ASYNCPOST':'true',
'DetailsId':'123'},
callback=self.parse_item)
def parse_item(self, response):
ajax_response = response.body
yield{'Response':ajax_response}
The response is suppose to be in HTML. Thing is the response is not totally the same as the one when I lookup to the Response request through Chrome Dev Tools. I've not taken all the form data into account yet (~10 / 25), could it be it needs all the form data even if they don't change depending the id ?
Thanks !
Related
So I am trying to scrape with the following URL:
Website
The page has some hidden text that unlocks after a click.
Their HTML code is also hidden and unhides after button clicks.
Before click:
After click:
How can I scrape this text?
BeautifulSoup doesn't work on this text.
If you open dev tools and click those buttons, you can see that you make a post request to https://en.indonetwork.co.id/ajax.
So you can either try to replicate that - see if you can capture the payload sent in the post request from a scrape of the home page and send that.
Or you could use selenium to load the page, click the button, and then capture the data.
It is not working with beautifulsoup because it is not static site. I mean when you click the phone button, it sends the request to api endpoint and then renders the response from that request. You can check this in network tab in dev tools.(I confirmed this)
BeautifulSoup only retrieves the first static html from request. It does not takes account of requests triggered by user interaction.
Solution of this is selenium.
Here are the exact steps you can follow to get this done.
Load the selenium with headerful browser.(headerful browser allows you to interact with web page easily)
Find the phone button and click on it.
Wait for some time until request gets processed and has been rendered on screen.
Then you can grab the content of the element as per your requirement.
Not so good solution
You can directly send the request to that exact same api endpoint. But it will have some security barriers like cors to go over from.
This is not good solution because api endpoint might get change or as this api call contains phone number they can make this more secure for future usage. But the interaction on web page nearly remains the same.
you don't need scraping, there is ajax call happening under the hood
import requests
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://en.indonetwork.co.id/company/surrama-groups').text)
v = soup.find(class_='btn btn-contact btn-contact-phone btn-contact-ajax').attrs
data_id = v['data-id']
data_text = v['data-text']
data_type = v['data-type']
data = requests.post('https://en.indonetwork.co.id/ajax', json={
'id': data_id,
'text': data_text,
'type': data_type,
'url': "leads/ajax"
}).json()
mobile_no = re.findall(r'(\d+)', data['text'])
print(mobile_no) #['622122520556', '6287885720277']
good evening,
im trying to write a programme that extracts the sell price of certain stocks and shares on a website called hl.co.uk
As you can imagine you have to search for the stock you want to see the sale price of.
my code so far is as follows:
import requests
from bs4 import BeautifulSoup as soup
url = "https://www.hl.co.uk/shares"
page = requests.get(url)
parsed_html = soup(page.content, 'html.parser')
form = parsed_html.find('form', id="stock_search")
input_tag = form.find('input').get('name')
submit = form.find('input', id="stock_search_submit").get('alt')
post_data = {input_tag: "fgt", "alt": submit}
i have been able to extract the correct form tag and the input names i require. but the website has multiple forms on this page.
how can i submit a post request to this website using the data i have in "post_data" to that specfic form in order for it to search the stockk/share that i desire and then give me the next page?
thanks in advance
Actually when you submit the form from the homepage, it redirect you to the the target page with an url looking like this, "https://www.hl.co.uk/shares/search-for-investments?stock_search_input=abc&x=56&y=35&category_list=CEHGINOPW", so in my opinion, instead of submitting the homepage form, you should directly call the target page with your own GET parameters, the url you're supposed to call will look like this https://www.hl.co.uk/shares/search-for-investments?stock_search_input=[your_keywords].
Hope this helped you
This is a pretty general problem which you can use google chrome's devtools to solve. Basically,
1- Navigate to the page where you have a form and bunch of fields.
In your case page should look like this:
2- Then choose XHR tab under Network tab which will filter out all Fetch and XHR requests. These requests are generally sent after a form submission and they return a JSON with resulting data most of the time.
3- Make sure you enable the checkbox on the top left Preserve Log so the list doesn't refresh when form is submitted.
4- Submit the form, then you'll see bunch of requests are being made. Inspect them to hopefully find what you're looking for.
In this case I found this URL endpoint which gives out the results as response.
https://www.hl.co.uk/ajax/funds/fund-search/search?investment=&companyid=1324§orid=132&wealth=&unitTypePref=&tracker=&payment_frequency=&payment_type=&yield=&standard_ocf=&perf12m=&perf36m=&perf60m=&fund_size=&num_holdings=&start=0&rpp=20&lo=0&sort=fd.full_description&sort_dir=asc&
You can see all the query parameters here as companyid, sectorid what you need to do is change those and just make a request to URL. Then you'll get the relevant information.
To retrieve those companyid and sectorid values you can send a get request to the page https://www.hl.co.uk/shares/search-for-investments?stock_search_input=ftg&x=17&y=23&category_list=CEHGINOPW which has those dropdowns and filter the html to find these values in the screenshot below:
You can see this documentation for BS4 to find tags inside HTML source, https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find
Hope everyone is safe and sound,
I am currently training on scrapy and decided to try scraping a website (Glassdoor) that requires logins.
I am stuck and wonder if anyone could check what I have done so far and give me a hand?
1)I loaded the glassdoor login page and open the inspect tool (in Chrome),
2)Selected the Network section and enter my logins in the page, once logged I looked for the login_input.htm file with a 302 status (POST) once selected I got into the the HEADER section but I cannot not find the FORMDATA section. So I do not have all the information to add in my code.
I tried a lot of online resources but cannot find a solution to this?
I also placed the code I started to work with:
import scrapy
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
class GdSpider(scrapy.Spider):
name = 'gd'
allowed_domains = ['https://www.glassdoor.co.uk/profile/login_input.htm']
start_urls = ('http://https://www.glassdoor.co.uk/profile/login_input.htm/',)
def parse(self, response):
return FormRequest.from_response(response,
formdata={'password': 'mypassword',
'username': 'myusername'},
callback=self.scrape_pages)
def scrape_pages(self, response):
open_in_browser(response)
Could anyone let me know what I did wrong please?
Thank you,
Arnaud
Glasdoor's login is a JavaScript rendered popup, if you disable JS you will see that nothing renders when you try to click the Sign In link or opening the link you have given.
This seems to be what you are looking for:
https://www.glassdoor.com/profile/ajax/loginAjax.htm
when you open the Sign In popup and try to login using any credentials (can be wrong, does not matter), you will see loginAjax.htm pop up in the Network tab. This one has a form that sends credentials by POST to the link I posted above.
Unfortunately it also does send a token with the credentials, so using this to log in might prove difficult.
For sending data you can use _urlencode from from scrapy.http.request.form import _urlencode like this:
inputs = [("key", "value"),]
body = _urlencode(inputs, response.encoding)
and send the body via POST to the above URL (inputs have to be a list of tuples) building a normal Scrapy Request.
I am trying to do scraping excise using python requests and beautifulsoup.
Basically i am crawling amazon web page.
I am able to crawl the first page without any issues.
r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers")
#do some thing
But when I try to crawl the 2nd page with "#2" in urls
r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers#2")
I see r still has same value that is equivalent to the value of 1 page.
r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers")
Dont know is #2 causing any trouble while making request to second page.
I also google about the issues but I could not find a fix.
What is right way to make request to url with #values. How to address this issue. Please advice.
"#2" is an fragment identifier, it's not visible on the server-side. Html content that you get, opening "http://someurl.com/page#123" is same as content for "http://someurl.com/page".
In browser you see second page because page's javascript see fragment identifier, create ajax request and inject new content into page. You should find ajax request's url and use it:
Looks like our url is:
http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2&aj
Easily we can understand that all we need is to change "pg" param value to get another pages.
You need to request to the url in the href attribute of the anchor tags describing the pagination. It's at the bottom of the page. If I inspect the page in developer console in google chrome I find the first pages url is like:
http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_1?ie=UTF8&pg=1
and the second page's url is like this:
http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2
a tag for the second page is like this:
<a page="2" ajaxUrl="http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2&ajax=1" href="http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2">21-40</a>
So you need to change the request url.
I am trying to fetch the HTML content of a website using urllib2. The site has a body onload event that submit a form on this site and hence it goes to a destination site and render the details I need.
response = urllib2.urlopen('www.xyz.com?var=999-999')
www.xyz.com contains a form that is posted to "www.abc.com", this
action value varies depending upon the content in url 'var=999-999'
which means action value will change if the var value changes to
'888-888'
response.read()
this still gives me the html content of "www.xyz.com" , but I want
that of resulting action url. Any suggestions of fetching the html
content from the final page?
Thanks in advance
You have to figured out the call to that second page, including parameters sent, so you can make that call yourself from your python code, best way is navigate first page with google chrome page inspector opened, then go to Network tab where the POST call would be captured and you can see the parameters sent and all. Then just recreate that same POST call from urllib2.