Why does href change after downloading page - python

I'm making a web parser and some href are driving me crazy
resp = urllib.request.urlopen("http://portogruaro.trasparenza-valutazione-merito.it/storico-atti")
page = resp.read().decode('utf-8')
print(page)
I found this in the downloaded page:
<a.. href="http://portogruaro.trasparenza-valutazione-merito.it/storico-atti;jsessionid=BE0A764D125947680F3DC6F85760302A?p_p_id=ConsultazioneAtti_WAR_maggioliportalmasterdetailportlet&p_p_lifecycle=2&p_p_state=normal&p_p_mode=view&p_p_resource_id=downloadAllegato&p_p_cacheability=cacheLevelPage&p_p_col_id=column-1&p_p_col_count=1&_ConsultazioneAtti_WAR_maggioliportalmasterdetailportlet_downloadTicket=oMrkWCwhyKWGcD67RyUPTMNzDbwk8ufAwUFVQ2_3Z4045lXXp1gcrKnaH7my84lD0jmgn_na5l1a5KnBtXxYtJYH7rbRP4GRdD53nB0MaBJSV6Ub1JDNoMnspbc2nmqr7a3ucdsOOBOUc4q0uTPd1Dg5ba1VE8DJ1kpf6C0eliencVxLYM8jPqxcSVokmrAjHqkHg4K3CFGZP9tGpCBTPQ"><i class="icon-download"></i> Allegato</a>
The href in the same anchor that you can see retrieving the same url with a browser is:
"http://portogruaro.trasparenza-valutazione-merito.it/storico-atti?p_p_id=ConsultazioneAtti_WAR_maggioliportalmasterdetailportlet&p_p_lifecycle=2&p_p_state=normal&p_p_mode=view&p_p_resource_id=downloadAllegato&p_p_cacheability=cacheLevelPage&p_p_col_id=column-1&p_p_col_count=1&_ConsultazioneAtti_WAR_maggioliportalmasterdetailportlet_downloadTicket=HAxoH6d7h0JNRoKoi9sl4R-tsWdtMVoLeeZ8dU5rUQL74MQNMpCnqmBwxX4uNCXuMk4Clb6EzvrIaUXNY0G4q9YGlmebpMDTrR3255v6bLGOiIWVwvbnKiaOoapsGBqwP4JPIUN1R9G8ajAnurCaqTknyMJkVLiKaw0Z4wI61pgAzqjSGHatViGIGIXkrV7IN6EduMl29vAARMvaHhEJ5g"
;jsessionid is added because the bot doesn't manage cookies, but It's not the only change...why?
EDIT: Maybe a particular number of session triggers a specific action?
If you download the web-page, the downloaded href won't work if you click on it, but clicking on the href that you see in the browser's page (view-source:link) will work.

;jsessionid is added because the bot doesn't manage cookies, but It's not the only change...why?
Hum ... apart from the ticket number and the jsessionid token, those are the same URL.
The parameters are not in the same order. But as far as I can tell, that doesn't change anything. Compare:
_ConsultazioneAtti_WAR_maggioliportalmasterdetailportlet_downloadTicket=oMrkWCwhyKWGcD67RyUPTMNzDbwk8ufAwUFVQ2_3Z4045lXXp1gcrKnaH7my84lD0jmgn_na5l1a5KnBtXxYtJYH7rbRP4GRdD53nB0MaBJSV6Ub1JDNoMnspbc2nmqr7a3ucdsOOBOUc4q0uTPd1Dg5ba1VE8DJ1kpf6C0eliencVxLYM8jPqxcSVokmrAjHqkHg4K3CFGZP9tGpCBTPQ
p_p_cacheability=cacheLevelPage
p_p_col_count=1
p_p_col_id=column-1
p_p_id=ConsultazioneAtti_WAR_maggioliportalmasterdetailportlet
p_p_lifecycle=2
p_p_mode=view
p_p_resource_id=downloadAllegato
p_p_state=normal
and
_ConsultazioneAtti_WAR_maggioliportalmasterdetailportlet_downloadTicket=HAxoH6d7h0JNRoKoi9sl4R-tsWdtMVoLeeZ8dU5rUQL74MQNMpCnqmBwxX4uNCXuMk4Clb6EzvrIaUXNY0G4q9YGlmebpMDTrR3255v6bLGOiIWVwvbnKiaOoapsGBqwP4JPIUN1R9G8ajAnurCaqTknyMJkVLiKaw0Z4wI61pgAzqjSGHatViGIGIXkrV7IN6EduMl29vAARMvaHhEJ5g"
p_p_cacheability=cacheLevelPage
p_p_col_count=1
p_p_col_id=column-1
p_p_id=ConsultazioneAtti_WAR_maggioliportalmasterdetailportlet
p_p_lifecycle=2
p_p_mode=view
p_p_resource_id=downloadAllegato
p_p_state=normal

Related

Scraping text that unhides on Button click with beautifulsoup and python?

So I am trying to scrape with the following URL:
Website
The page has some hidden text that unlocks after a click.
Their HTML code is also hidden and unhides after button clicks.
Before click:
After click:
How can I scrape this text?
BeautifulSoup doesn't work on this text.
If you open dev tools and click those buttons, you can see that you make a post request to https://en.indonetwork.co.id/ajax.
So you can either try to replicate that - see if you can capture the payload sent in the post request from a scrape of the home page and send that.
Or you could use selenium to load the page, click the button, and then capture the data.
It is not working with beautifulsoup because it is not static site. I mean when you click the phone button, it sends the request to api endpoint and then renders the response from that request. You can check this in network tab in dev tools.(I confirmed this)
BeautifulSoup only retrieves the first static html from request. It does not takes account of requests triggered by user interaction.
Solution of this is selenium.
Here are the exact steps you can follow to get this done.
Load the selenium with headerful browser.(headerful browser allows you to interact with web page easily)
Find the phone button and click on it.
Wait for some time until request gets processed and has been rendered on screen.
Then you can grab the content of the element as per your requirement.
Not so good solution
You can directly send the request to that exact same api endpoint. But it will have some security barriers like cors to go over from.
This is not good solution because api endpoint might get change or as this api call contains phone number they can make this more secure for future usage. But the interaction on web page nearly remains the same.
you don't need scraping, there is ajax call happening under the hood
import requests
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://en.indonetwork.co.id/company/surrama-groups').text)
v = soup.find(class_='btn btn-contact btn-contact-phone btn-contact-ajax').attrs
data_id = v['data-id']
data_text = v['data-text']
data_type = v['data-type']
data = requests.post('https://en.indonetwork.co.id/ajax', json={
'id': data_id,
'text': data_text,
'type': data_type,
'url': "leads/ajax"
}).json()
mobile_no = re.findall(r'(\d+)', data['text'])
print(mobile_no) #['622122520556', '6287885720277']

Scrape displayed data through onclick using Selenium and Scrapy

I'm doing a script in python using Scrapy in order to scrape data from a website using an authentication.
The page I'm scraping is really painful because mainly made with javascript and AJAX requests. All the body of the page is put inside a <form> that allow to change the page using a submit button. URL don't change (and it's a .aspx).
I have successfully made that scrape all the data I need from page one, then changing page clicking on this input button using this code :
yield FormRequest.from_response(response,
formname="Form",
clickdata={"class":"PageNext"},
callback=self.after_login)
The after_login method is scraping the data.
However I need data that appear in another div after clicking on a container with onclick attribute. I need to do a loop in order to click on each container, displaying the data, scraping them and just after that I'm going to the next page and do the same process.
The thing is I can't find how to make the process where "the script" just click on the container using Selenium (while being logged in, if not I cannot go to this page) and then Scrapy is scraping the data that after the XHR request has been made.
I did a lot of research on the internet but could not try any solution.
Thanks !
Ok so I've almost got what I want, following #malberts advices.
I've used this kind of code in order to get the Ajax response request :
yield scrapy.FormRequest.from_response(
response=response,
formdata={
'param1':param1value,
'param2':param2value,
'__VIEWSTATE':__VIEWSTATE,
'__ASYNCPOST':'true',
'DetailsId':'123'},
callback=self.parse_item)
def parse_item(self, response):
ajax_response = response.body
yield{'Response':ajax_response}
The response is suppose to be in HTML. Thing is the response is not totally the same as the one when I lookup to the Response request through Chrome Dev Tools. I've not taken all the form data into account yet (~10 / 25), could it be it needs all the form data even if they don't change depending the id ?
Thanks !

web scraping python <span> with id

I want to scrap data in the <span/> attribute for a given website using BeautifulSoup. You can see at the screenshot where it locates. However, the code that I'm using is just returning an empty list. I can't find the data in the list that I want. What am I doing wrong?
from bs4 import BeautifulSoup
from urllib import request
url = "http://144.122.167.229"
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
data = opener.open(url).read()
soup = BeautifulSoup(data, 'html.parser')
your_data = list()
for line in soup.findAll('span', attrs={'id': 'mc1_legend_value'}):
your_data.append(line.text)
for line in soup.findAll('span'):
your_data.append(line.text)
ScreenShot : https://imgur.com/a/z0vNh
Thank you.
The dashboard from the screenshot looks to me like something javascript would generate. If you can't find the tag in the page source, that means it was later added by some javascript code or your browser tried to fix some html which it considered broken or out of place.
Keep in mind that right now you're sending a request to a server and it serves you the plain html back. A browser would parse the html and execute any javascript code if it finds any. In your case, beautiful soup or urllib doesn't execute any javascript code. urllib fetches the html and beautiful soup makes it easier to parse and extract relevant information.
If you want to get the value from that tag, I recommend using a headless browser to render your page and just after that parse it's html through beautiful soup or any other parser.
Give a try to selenium: http://selenium-python.readthedocs.io/.
You can control your own browser programmatically. You can make it request the page for you, render it, save the new html in a variable, parse it using beautifoul soup and extract the values you're interested in. I believe that it already has it's own parser implemented which you can use directly to search for that tag.
Or maybe even scrapinghub's splash: https://github.com/scrapinghub/splash
If the dashboard communicates with a server in real-time and that value is continuously received from the server, you could take a look at what requests are sent to the server in order to get that value. Take a look in developer console under the networks tab. Press F12 to open the developer console and click on Network. Refresh the page and you should get all the request send to the server along with the responses. Requests sent by the javascript are usually XMLHttpRequests. Click on XHR in the Network tab to filter out any other requests. (These are instructions for Google Chrome. Firefox might differ a bit).

Python Scrape with requests and beautifulsoup

I am trying to do scraping excise using python requests and beautifulsoup.
Basically i am crawling amazon web page.
I am able to crawl the first page without any issues.
r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers")
#do some thing
But when I try to crawl the 2nd page with "#2" in urls
r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers#2")
I see r still has same value that is equivalent to the value of 1 page.
r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers")
Dont know is #2 causing any trouble while making request to second page.
I also google about the issues but I could not find a fix.
What is right way to make request to url with #values. How to address this issue. Please advice.
"#2" is an fragment identifier, it's not visible on the server-side. Html content that you get, opening "http://someurl.com/page#123" is same as content for "http://someurl.com/page".
In browser you see second page because page's javascript see fragment identifier, create ajax request and inject new content into page. You should find ajax request's url and use it:
Looks like our url is:
http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2&aj
Easily we can understand that all we need is to change "pg" param value to get another pages.
You need to request to the url in the href attribute of the anchor tags describing the pagination. It's at the bottom of the page. If I inspect the page in developer console in google chrome I find the first pages url is like:
http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_1?ie=UTF8&pg=1
and the second page's url is like this:
http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2
a tag for the second page is like this:
<a page="2" ajaxUrl="http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2&ajax=1" href="http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2">21-40</a>
So you need to change the request url.

How to get links from "Inspect element" on page in python?

I need to get the video link from a web page. I click on inspect element and go to Network tab, and I see a link I need to get... But how can I access this link trough python?
this is the situation:
http://i.imgur.com/DS811BW.jpg?1
the link is positioned in the header:
http://i.imgur.com/5C2vKje.jpg
I need only link, I don't need to download the video.
What would be the best path to go? Maybe Selenium?
Selenium will work, yes. What you'll want to do is find the element in the DOM that's pulling it in. Before you go that route though, you should try to figure out what element you're after manually. You're probably after a video tag and its child source tag.
HTML 5 video tag docs: http://www.w3schools.com/tags/tag_video.asp
Selenium selector docs: https://selenium-python.readthedocs.org/locating-elements.html
You just need to do a HTTP request to get the page and then go through the response to get the url. You need to define the XPath and use lxml to get the URL. Something like (it is just an example, probably will not work straight forward):
import lxml.html as parser
import requests
path = <define the XPATH>
url = <your url>
data = do_request(url)
if data:
doc = parser.fromstring(data)
url_res = doc.xpath(path) #the url from the webpage
#do_requests() example
def do_request(url):
r = requests.get(url)
return r.text if r.status_code == 200 else None

Categories